For the code and guide of the baseline system, please refer to video only system at Github link
For more description of the baseline system, please refer to the following paper:
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep AnalysisThis is the baseline system for the Chat-scenario Chinese Lipreading (ChatCLR) Challenge task 1. This task concerns the lipreading of predefined wake word(s) with only visual modality. ‘1’ indicates that the sample contains a wake word, and ‘0’ indicates the opposite.
The video system implements a neural network (NN) based approach and a network consisting of CNN layers, LSTM layer, and fully connected layers are trained to assign labels to the samples.
To get visual embeddings, we first crop lip region of interest (ROI) from video streams, then use the lipreading TCN to extract 512-dimensional features. Several data augmentation methods (optional) are also adopted for video data.
We utilize the Score as the performance criterion of the wake word lipreading task, which is calculated as the sum of the false reject rate (FRR) and false alarm rate (FAR). The system performances on the development set are as follows:
DA | FAR | FRR | Score |
---|---|---|---|
No | 0.147 | 0.494 | 0.641 |
Yes | 0.384 | 0.087 | 0.471 |
“DA” indicates data augmentation.
--- kws_net_only_video/run.sh ---
# Defining corpus directory
data_root=
# Defining path to python interpreter
python_path=
cd ../kws_net_only_video
sh run.sh
numpy
OpenCV
tqdm
sox