For details, please refer to Github link
The training of AVSD module is consistent with the track1.
The training of AVSR module uses the oracle diarization results.
Weighted Prediction Error(WPE) dereverberation and BeamformIt are used to reduce the reverberations of speech signals. The algorithms are implemented with open-source toolkit, nara_wpe and BeamformIt.
For training, development, and test sets, we prepare data directories and the lexicon in the format expected by kaldi respectively. Note that we choose DaCiDian raw resource and convert it to kaldi lexicon format.
Language model
We segment MISP speech transcription for language model training by applying DaCiDian as dict and Jieba open-source toolkit. For the language model, we choose a maximum entropy-based 3-gram model, which achieves the best perplexity, from n-gram(n=2,3,4) models trained on MISP speech transcripts with different smoothing algorithms and parameters sets. And the selected 3-gram model has 516600 unigrams, 432247 bigrams, and 915962 trigrams respectively. Note that the temporary and final language models are stored in /data/srilm.
Acoustic model
The acoustic model of the AVSR system is built largely following the Kaldi recipes which mainly contain two stages: GMM-HMM state model and DNN-HMM model.
For features extraction, we extract 13-dimensional MFCC features plus 3-dimensional pitches. As a start point for triphone models, a monophone model is trained on a subset of 50k utterances. Then a small triphone model and a larger triphone model are consecutively trained using delta features on a subset of 100k utterances and the whole dataset respectively. In the third triphone model training process, an MLLT-based global transform is estimated iteratively on the top of LDA feature to extract independent speaker features. For the fourth triphone model, feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT) is applied in the training.
Based on the tied-triphone state alignments from GMM, DNN is configured and trained to replace GMM. The input features are 40-dimensional FBank features with cepstral normalization and the 96 × 96 (W × H) lip ROI.
The RTTM file as the output of the AVSD module contains the information of the Session, SPK, Tstart, and Tdur.
where:
During the inference, the RTTM file is used for segmenting audio and video data in AVSR module.
If speaker IDs of the final decoding results are global, you only need to caculate the CER. If you run your own model and get the local speaker IDs, you need to caculate the cpCER.
The AVSD+AVSR model is our AVDR baseline system.
# Setting local system jobs (local CPU - no external clusters)
export train_cmd=run.pl
export decode_cmd=run.pl
--- path.sh ---
# Defining Kaldi root directory
export KALDI_ROOT=
# Setting paths to useful tools
export PATH=
# Enable SRILM
. $KALDI_ROOT/tools/env.sh
# Variable needed for proper data sorting
export LC_ALL=C
--- run_misp.sh --- # Defining corpus directory misp2022_corpus= # Defining path to beamforIt executable file bearmformit_path = # Defining path to python interpreter python_path = # the directory to host coordinate information used to crop ROI data_roi = # dictionary directory dict_dir=
./run.sh
# options:
--stage -1 change the number to start from different training stages
Kaldi
Python Packages: numpy tqdm jieba
Other Tools: nara_wpe Beamformit SRILM
If you find this code useful in your research, please consider to cite the following papers: @inproceedings{chen2022first, title={The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results}, author={Chen, Hang and Zhou, Hengshun and Du, Jun and Lee, Chin-Hui and Chen, Jingdong and Watanabe, Shinji and Siniscalchi, Sabato Marco and Scharenborg, Odette and Liu, Di-Yuan and Yin, Bao-Cai and others}, booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={9266--9270}, year={2022}, organization={IEEE} } @inproceedings{2022misptask2, author={Chen, Hang and Du, Jun and Dai, Yusheng and Lee, Chin-Hui and Siniscalchi, Sabato Marco and Watanabe, Shinji and Scharenborg, Odette and Chen, Jingdong and Yin, Bao-Cai and Pan, jia}, booktitle={Proc. INTERSPEECH 2022}, title={Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis}, year={2022}}