Wake Word Lipreading Baseline

For the code and guide of the baseline system, please refer to video only system at Github link

For more description of the baseline system, please refer to the following paper:

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

  • Introduction
  • This is the baseline system for the Chat-scenario Chinese Lipreading (ChatCLR) Challenge task 1. This task concerns the lipreading of predefined wake word(s) with only visual modality. ‘1’ indicates that the sample contains a wake word, and ‘0’ indicates the opposite.

  • System Description
  • The video system implements a neural network (NN) based approach and a network consisting of CNN layers, LSTM layer, and fully connected layers are trained to assign labels to the samples.

  • Video Wake Word Spotting
  • To get visual embeddings, we first crop lip region of interest (ROI) from video streams, then use the lipreading TCN to extract 512-dimensional features. Several data augmentation methods (optional) are also adopted for video data.

Results

We utilize the Score as the performance criterion of the wake word lipreading task, which is calculated as the sum of the false reject rate (FRR) and false alarm rate (FAR). The system performances on the development set are as follows:

DA FAR FRR Score
No 0.147 0.494 0.641
Yes 0.384 0.087 0.471

“DA” indicates data augmentation.

Setting Paths

  • kws_net_only_video

--- kws_net_only_video/run.sh ---
# Defining corpus directory
data_root=
# Defining path to python interpreter
python_path=

Running the baseline system

  • Run Video Training

cd ../kws_net_only_video
sh run.sh

Requirments

  • pytorch
  • python packages
  • numpy

    OpenCV

    tqdm

  • other tools
  • sox