Task2 NN-HMM Based AVSR Baseline

For details, please refer to Github link

Data preparation
- speech enhancement
We provide two baseline speech enhancement front-ends, Weighted Prediction Error(WPE) dereverberation and weighted delay-and-sum(DAS) beamforming, to reduce reverberations and noises of speech signals. These two algorithms are implemented with open-source toolkits, nara_wpe and BeamformIt, respectively.
- prepare data and language directory for kaldi
For training, development, and test sets, we prepare data directories and the lexicon in the format expected by kaldi respectively. Note that we choose DaCiDian raw resource and convert it to kaldi lexicon format.
Language model

We segment MISP speech transcription for language model training by applying DaCiDian as dict and Jieba open-source toolkit. For the language model, we choose a maximum entropy-based 3-gram model, which achieves the best perplexity, from n-gram(n=2,3,4) models trained on MISP speech transcripts with different smoothing algorithms and parameters sets. And the selected 3-gram model has 516600 unigrams, 432247 bigrams, and 915962 trigrams respectively. Note that the temporary and final language models are stored in /data/srilm.
Acoustic model

The acoustic model of the ASR system is built largely following the Kaldi CHIME6 recipes which mainly contain two stages: GMM-HMM state model and TDNN deep learning model.
- GMM-HMM
  
  For features extraction, we extract 13-dimensional MFCC features plus 3-dimensional pitches. As a start point for triphone models, a monophone model is trained on a subset of 50k utterances. Then a small triphone model and a larger triphone model are consecutively trained using delta features on a subset of 100k utterances and the whole dataset respectively. In the third triphone model training process, an MLLT-based global transform is estimated iteratively on the top of LDA feature to extract independent speaker features. For the fourth triphone model, feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT) is applied in the training.
- NN-HMM
  
  Based on the tied-triphone state alignments from GMM, TDNN is configured and trained to replace GMM. Here two data augmentation technologies, speed-perturbation and volume-perturbation are applied on signal level. The input features are 40-dimensional high-resolution MFCC features with cepstral normalization. Note that for each frame-wise input, a 100-dimensional i-vector is also attached, whose extractor was trained on the expanded corpus. An advanced time-delayed neural network (TDNN) baseline using lattice-free maximum mutual information (LF-MMI) training and other strategies is adopted in the system, and you can consult the paper and the document for more details.
Audio-Visual Speech Recognition

Based on the NN-HMM hybrid ASR system mentioned above, we try to use lipreading to enhance the system. To get visual embeddings, we firstly crop mouth ROIs from video streams, then use the lipreading TCN to extract 512-dimensional features. Still using TDNN, we simply concentrate two modals embeddings(512+40+100) in the training stage. And the results preliminary show that the video information does help enhance ASR and it still has a very big upgrade space. Here comes the results.

Quick start

Setting Local System Jobs

# Setting local system jobs (local CPU - no external clusters) export train_cmd=run.pl export decode_cmd=run.pl

Setting Paths

--- path.sh --- # Defining Kaldi root directory export KALDI_ROOT= # Setting paths to useful tools export PATH= # Enable SRILM . $KALDI_ROOT/tools/env.sh # Variable needed for proper data sorting export LC_ALL=C

--- run.sh --- # Defining corpus directory misp2021_corpus= # Defining path to beamforIt executable file bearmformit_path = # Defining path to python interpreter python_path = # the directory to host coordinate information used to crop ROI data_roi = # dictionary directory dict_dir=

Running Training

./run.sh # options: --stage -1 change the number to start from different training stages --nnet_stage -10 the number controls tdnn training stages including preprocessing and postprocessing

Other Tips

Here some naming rules for directories produced during the four TDNN models training processing

Models	Data for Training	Data for Alignment	Model Directories
Chain-TDNN-A	data/train_far_hires	data/train_far	exp/train_far
Chain-TDNN-A*	data/train_far_sp_hires	data/train_far_sp	exp/train_far_sp
Chain-TDNN-AV	data/train_far_hires_av	data/train_far_av	exp/train_far_av
Chain-TDNN-AV*	data/train_far_sp_hires_av	data/train_far_sp_av	exp/train_far_av_sp

Requirments

Kaldi
Python Packages:
numpy
tqdm
jieba
Other Tools:
nara_wpe
Beamformit
SRILM
Lipreading using Temporal Convolutional Networks

Multimodal Information Based Speech Processing (MISP) Challenge 2021

Task2 NN-HMM Based AVSR Baseline

Quick start

Requirments