Task2 NN-HMM Based AVSR Baseline

For details, please refer to Github link

  • Data preparation

    • speech enhancement

    We provide two baseline speech enhancement front-ends, Weighted Prediction Error(WPE) dereverberation and weighted delay-and-sum(DAS) beamforming, to reduce reverberations and noises of speech signals. These two algorithms are implemented with open-source toolkits, nara_wpe and BeamformIt, respectively.

    • prepare data and language directory for kaldi

    For training, development, and test sets, we prepare data directories and the lexicon in the format expected by kaldi respectively. Note that we choose DaCiDian raw resource and convert it to kaldi lexicon format.

  • Language model

    We segment MISP speech transcription for language model training by applying DaCiDian as dict and Jieba open-source toolkit. For the language model, we choose a maximum entropy-based 3-gram model, which achieves the best perplexity, from n-gram(n=2,3,4) models trained on MISP speech transcripts with different smoothing algorithms and parameters sets. And the selected 3-gram model has 516600 unigrams, 432247 bigrams, and 915962 trigrams respectively. Note that the temporary and final language models are stored in /data/srilm.

  • Acoustic model

    The acoustic model of the ASR system is built largely following the Kaldi CHIME6 recipes which mainly contain two stages: GMM-HMM state model and TDNN deep learning model.

    • GMM-HMM

      For features extraction, we extract 13-dimensional MFCC features plus 3-dimensional pitches. As a start point for triphone models, a monophone model is trained on a subset of 50k utterances. Then a small triphone model and a larger triphone model are consecutively trained using delta features on a subset of 100k utterances and the whole dataset respectively. In the third triphone model training process, an MLLT-based global transform is estimated iteratively on the top of LDA feature to extract independent speaker features. For the fourth triphone model, feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT) is applied in the training.

    • NN-HMM

      Based on the tied-triphone state alignments from GMM, TDNN is configured and trained to replace GMM. Here two data augmentation technologies, speed-perturbation and volume-perturbation are applied on signal level. The input features are 40-dimensional high-resolution MFCC features with cepstral normalization. Note that for each frame-wise input, a 100-dimensional i-vector is also attached, whose extractor was trained on the expanded corpus. An advanced time-delayed neural network (TDNN) baseline using lattice-free maximum mutual information (LF-MMI) training and other strategies is adopted in the system, and you can consult the paper and the document for more details.

  • Audio-Visual Speech Recognition

    Based on the NN-HMM hybrid ASR system mentioned above, we try to use lipreading to enhance the system. To get visual embeddings, we firstly crop mouth ROIs from video streams, then use the lipreading TCN to extract 512-dimensional features. Still using TDNN, we simply concentrate two modals embeddings(512+40+100) in the training stage. And the results preliminary show that the video information does help enhance ASR and it still has a very big upgrade space. Here comes the results.

Quick start

  • Setting Local System Jobs

# Setting local system jobs (local CPU - no external clusters)
export train_cmd=run.pl
export decode_cmd=run.pl

  • Setting Paths

--- path.sh ---
# Defining Kaldi root directory
export KALDI_ROOT=
# Setting paths to useful tools
export PATH=
# Enable SRILM
. $KALDI_ROOT/tools/env.sh
# Variable needed for proper data sorting
export LC_ALL=C

--- run.sh ---
# Defining corpus directory
misp2021_corpus=
# Defining path to beamforIt executable file
bearmformit_path =
# Defining path to python interpreter
python_path =
# the directory to host coordinate information used to crop ROI
data_roi =
# dictionary directory
dict_dir=

  • Running Training

./run.sh # options:
--stage -1 change the number to start from different training stages
--nnet_stage -10 the number controls tdnn training stages including preprocessing and postprocessing

  • Other Tips

Here some naming rules for directories produced during the four TDNN models training processing

Models Data for Training Data for Alignment Model Directories
Chain-TDNN-A data/train_far_hires data/train_far exp/train_far
Chain-TDNN-A* data/train_far_sp_hires data/train_far_sp exp/train_far_sp
Chain-TDNN-AV data/train_far_hires_av data/train_far_av exp/train_far_av
Chain-TDNN-AV* data/train_far_sp_hires_av data/train_far_sp_av exp/train_far_av_sp

Requirments