Исследование влияния высокоскоростных видеоданных на точность распознавания аудиовизуальной речи

Денис Викторович Иванько; Дмитрий Александрович Рюмин; Алексей Анатольевич Карпов; Милош Железны

doi:10.31799/1684-8853-2019-2-26-34

Ivanko Denis Ph.D. studentITMO University (Saint Petersburg National Research University of Information Technologies, Mechanics and Optics)
Ryumin Dmitri
Karpov Alexey заведующий лабораторией речевых и многомодальных интерфейсов
Zelezny Milos vice deanUniversity of West Bohemia

DOI:

https://doi.org/10.31799/1684-8853-2019-2-26-34

Keywords:

high-speed video camera, audio-visual speech recognition, noisy conditions, visemes, multimodal processing, lipreading

Abstract

Introduction: The effectiveness of modern automatic speech recognition systems in quiet acoustic conditions is quite high and
reaches 90–95%. However, in noisy uncontrolled environment, acoustic signals are often distorted, which greatly reduces the resulting
recognition accuracy. In adverse conditions, it seems appropriate to use the visual information about the speech, as it is not affected by
the acoustic noise. At the moment, there are no studies which objectively reflect the dependence of visual speech recognition accuracy on
the video frame rate, and there are no relevant audio-visual databases for model training. Purpose: Improving the reliability and accuracy
of the automatic audio-visual Russian speech recognition system; collecting representative audio-visual database and developing an
experimental setup. Methods: For audio-visual speech recognition, we used coupled hidden Markov model architectures. For parametric
representation of audio and visual features, we used mel-frequency cepstral coefficients and principal component analysis-based pixel
features. Results: In the experiments, we studied 5 different rates of video data: 25, 50, 100, 150, and 200 fps. Experiments have
shown a positive effect from the use of a high-speed video camera: we achieved an absolute increase in accuracy of 1.48% for a bimodal
system and 3.10% for a unimodal one, as compared to the standard recording speed of 25 fps. During the experiments, test data for all
speakers were added with two types of noise: wide-band white noise and “babble noise”. Analysis shows that bimodal speech recognition
exceeds unimodal in accuracy, especially for low SNR values <15 dB. At very low SNR values <5 dB, the acoustic information becomes
non-informative, and the best results are achieved by a unimodal visual speech recognition system. Practical relevance: The use of a
high-speed camera can improve the accuracy and robustness of a continuous audio-visual Russian speech recognition system.

Information processing and control

Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy

DOI:

Keywords:

Abstract

Published

How to Cite

Issue

Section

Impact Factor

Navigate

In the Web

In the Web