Wake Word Lipreading Baseline

For the code and guide of the baseline system, please refer to video only system at Github link

For more description of the baseline system, please refer to the following paper:

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

Introduction

This is the baseline system for the Chat-scenario Chinese Lipreading (ChatCLR) Challenge task 1. This task concerns the lipreading of predefined wake word(s) with only visual modality. ‘1’ indicates that the sample contains a wake word, and ‘0’ indicates the opposite.

System Description

The video system implements a neural network (NN) based approach and a network consisting of CNN layers, LSTM layer, and fully connected layers are trained to assign labels to the samples.

Video Wake Word Spotting

To get visual embeddings, we first crop lip region of interest (ROI) from video streams, then use the lipreading TCN to extract 512-dimensional features. Several data augmentation methods (optional) are also adopted for video data.

Results

We utilize the Score as the performance criterion of the wake word lipreading task, which is calculated as the sum of the false reject rate (FRR) and false alarm rate (FAR). The system performances on the development set are as follows:

DA	FAR	FRR	Score
No	0.147	0.494	0.641
Yes	0.384	0.087	0.471

“DA” indicates data augmentation.

Setting Paths

kws_net_only_video

--- kws_net_only_video/run.sh --- # Defining corpus directory data_root= # Defining path to python interpreter python_path=

Running the baseline system

Run Video Training

cd ../kws_net_only_video sh run.sh

Requirments

pytorch
python packages

numpy

OpenCV

tqdm

other tools

sox