Track2 AVDR Baseline

For details, please refer to Github link

AVSD Module Training

The training of AVSD module is consistent with the track1.

AVSR Module Training

The training of AVSR module uses the oracle diarization results.

Data preparation

speech enhancement

Weighted Prediction Error(WPE) dereverberation and BeamformIt are used to reduce the reverberations of speech signals. The algorithms are implemented with open-source toolkit, nara_wpe and BeamformIt.

prepare data and language directory for kaldi

For training, development, and test sets, we prepare data directories and the lexicon in the format expected by kaldi respectively. Note that we choose DaCiDian raw resource and convert it to kaldi lexicon format.

Language model

We segment MISP speech transcription for language model training by applying DaCiDian as dict and Jieba open-source toolkit. For the language model, we choose a maximum entropy-based 3-gram model, which achieves the best perplexity, from n-gram(n=2,3,4) models trained on MISP speech transcripts with different smoothing algorithms and parameters sets. And the selected 3-gram model has 516600 unigrams, 432247 bigrams, and 915962 trigrams respectively. Note that the temporary and final language models are stored in /data/srilm.
Acoustic model

The acoustic model of the AVSR system is built largely following the Kaldi recipes which mainly contain two stages: GMM-HMM state model and DNN-HMM model.
- GMM-HMM
- DNN-HMM

Inference

The RTTM file as the output of the AVSD module contains the information of the Session, SPK, T^start, and T^dur.

where:

Session_k: k-th session
SPK_i: i-th speaker
T_j^start: the start time of the j-th utterance for SPK_i
T_j^dur: the duration of the j-th utterance for SPK_i

During the inference, the RTTM file is used for segmenting audio and video data in AVSR module.

If speaker IDs of the final decoding results are global, you only need to caculate the CER. If you run your own model and get the local speaker IDs, you need to caculate the cpCER.

Results

The AVSD+AVSR model is our AVDR baseline system.

Quick start

Setting Local System Jobs

# Setting local system jobs (local CPU - no external clusters) export train_cmd=run.pl export decode_cmd=run.pl

Setting Paths

--- path.sh --- # Defining Kaldi root directory export KALDI_ROOT= # Setting paths to useful tools export PATH= # Enable SRILM . $KALDI_ROOT/tools/env.sh # Variable needed for proper data sorting export LC_ALL=C

--- run_misp.sh --- # Defining corpus directory misp2022_corpus= # Defining path to beamforIt executable file bearmformit_path = # Defining path to python interpreter python_path = # the directory to host coordinate information used to crop ROI data_roi = # dictionary directory dict_dir=

Running Training

./run.sh # options: --stage -1 change the number to start from different training stages

Requirments

Kaldi
Python Packages:
numpy
tqdm
jieba
Other Tools:
nara_wpe
Beamformit
SRILM

Citation

If you find this code useful in your research, please consider to cite the following papers:
@inproceedings{chen2022first,
title={The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results},
author={Chen, Hang and Zhou, Hengshun and Du, Jun and Lee, Chin-Hui and Chen, Jingdong and Watanabe, Shinji and Siniscalchi, Sabato Marco and Scharenborg, Odette and Liu, Di-Yuan and Yin, Bao-Cai and others},
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={9266--9270},
year={2022},
organization={IEEE}
}
@inproceedings{2022misptask2,
author={Chen, Hang and Du, Jun and Dai, Yusheng and Lee, Chin-Hui and Siniscalchi, Sabato Marco and Watanabe, Shinji and Scharenborg, Odette and Chen, Jingdong and Yin, Bao-Cai and Pan, jia},
booktitle={Proc. INTERSPEECH 2022},
title={Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis},
year={2022}}

Multimodal Information Based Speech Processing (MISP) Challenge 2022