Track2 AVDR Baseline

For details, please refer to Github link

AVSD Module Training

The training of AVSD module is consistent with the track1.

AVSR Module Training

The training of AVSR module uses the oracle diarization results.

  1. Data preparation
    • speech enhancement
    • Weighted Prediction Error(WPE) dereverberation and BeamformIt are used to reduce the reverberations of speech signals. The algorithms are implemented with open-source toolkit, nara_wpe and BeamformIt.

    • prepare data and language directory for kaldi
    • For training, development, and test sets, we prepare data directories and the lexicon in the format expected by kaldi respectively. Note that we choose DaCiDian raw resource and convert it to kaldi lexicon format.

  2. Language model

    We segment MISP speech transcription for language model training by applying DaCiDian as dict and Jieba open-source toolkit. For the language model, we choose a maximum entropy-based 3-gram model, which achieves the best perplexity, from n-gram(n=2,3,4) models trained on MISP speech transcripts with different smoothing algorithms and parameters sets. And the selected 3-gram model has 516600 unigrams, 432247 bigrams, and 915962 trigrams respectively. Note that the temporary and final language models are stored in /data/srilm.

  3. Acoustic model

    The acoustic model of the AVSR system is built largely following the Kaldi recipes which mainly contain two stages: GMM-HMM state model and DNN-HMM model.

    • GMM-HMM
    • For features extraction, we extract 13-dimensional MFCC features plus 3-dimensional pitches. As a start point for triphone models, a monophone model is trained on a subset of 50k utterances. Then a small triphone model and a larger triphone model are consecutively trained using delta features on a subset of 100k utterances and the whole dataset respectively. In the third triphone model training process, an MLLT-based global transform is estimated iteratively on the top of LDA feature to extract independent speaker features. For the fourth triphone model, feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT) is applied in the training.

    • DNN-HMM
    • Based on the tied-triphone state alignments from GMM, DNN is configured and trained to replace GMM. The input features are 40-dimensional FBank features with cepstral normalization and the 96 × 96 (W × H) lip ROI.

Inference

The RTTM file as the output of the AVSD module contains the information of the Session, SPK, Tstart, and Tdur.

where:

  • Sessionk: k-th session
  • SPKi: i-th speaker
  • Tjstart: the start time of the j-th utterance for SPKi
  • Tjdur: the duration of the j-th utterance for SPKi

During the inference, the RTTM file is used for segmenting audio and video data in AVSR module.

If speaker IDs of the final decoding results are global, you only need to caculate the CER. If you run your own model and get the local speaker IDs, you need to caculate the cpCER.

Results

The AVSD+AVSR model is our AVDR baseline system.

Quick start

  • Setting Local System Jobs

# Setting local system jobs (local CPU - no external clusters)
export train_cmd=run.pl
export decode_cmd=run.pl

  • Setting Paths

--- path.sh ---
# Defining Kaldi root directory
export KALDI_ROOT=
# Setting paths to useful tools
export PATH=
# Enable SRILM
. $KALDI_ROOT/tools/env.sh
# Variable needed for proper data sorting
export LC_ALL=C

--- run_misp.sh ---
# Defining corpus directory
misp2022_corpus=
# Defining path to beamforIt executable file
bearmformit_path =
# Defining path to python interpreter
python_path =
# the directory to host coordinate information used to crop ROI
data_roi =
# dictionary directory
dict_dir=

  • Running Training

./run.sh
# options:
          --stage           -1           change the number to start from different training stages

Requirments

Citation

If you find this code useful in your research, please consider to cite the following papers:
@inproceedings{chen2022first,
title={The first multimodal information based speech processing (misp) challenge: Data, tasks, baselines and results},
author={Chen, Hang and Zhou, Hengshun and Du, Jun and Lee, Chin-Hui and Chen, Jingdong and Watanabe, Shinji and Siniscalchi, Sabato Marco and Scharenborg, Odette and Liu, Di-Yuan and Yin, Bao-Cai and others},
booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={9266--9270},
year={2022},
organization={IEEE}
}
@inproceedings{2022misptask2,
author={Chen, Hang and Du, Jun and Dai, Yusheng and Lee, Chin-Hui and Siniscalchi, Sabato Marco and Watanabe, Shinji and Scharenborg, Odette and Chen, Jingdong and Yin, Bao-Cai and Pan, jia},
booktitle={Proc. INTERSPEECH 2022},
title={Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis},
year={2022}}