Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy
Keywords:
high-speed video camera, audio-visual speech recognition, noisy conditions, visemes, multimodal processing, lipreadingAbstract
Introduction: The effectiveness of modern automatic speech recognition systems in quiet acoustic conditions is quite high and
reaches 90–95%. However, in noisy uncontrolled environment, acoustic signals are often distorted, which greatly reduces the resulting
recognition accuracy. In adverse conditions, it seems appropriate to use the visual information about the speech, as it is not affected by
the acoustic noise. At the moment, there are no studies which objectively reflect the dependence of visual speech recognition accuracy on
the video frame rate, and there are no relevant audio-visual databases for model training. Purpose: Improving the reliability and accuracy
of the automatic audio-visual Russian speech recognition system; collecting representative audio-visual database and developing an
experimental setup. Methods: For audio-visual speech recognition, we used coupled hidden Markov model architectures. For parametric
representation of audio and visual features, we used mel-frequency cepstral coefficients and principal component analysis-based pixel
features. Results: In the experiments, we studied 5 different rates of video data: 25, 50, 100, 150, and 200 fps. Experiments have
shown a positive effect from the use of a high-speed video camera: we achieved an absolute increase in accuracy of 1.48% for a bimodal
system and 3.10% for a unimodal one, as compared to the standard recording speed of 25 fps. During the experiments, test data for all
speakers were added with two types of noise: wide-band white noise and “babble noise”. Analysis shows that bimodal speech recognition
exceeds unimodal in accuracy, especially for low SNR values <15 dB. At very low SNR values <5 dB, the acoustic information becomes
non-informative, and the best results are achieved by a unimodal visual speech recognition system. Practical relevance: The use of a
high-speed camera can improve the accuracy and robustness of a continuous audio-visual Russian speech recognition system.