The impact of the Lombard effect on audio and visual speech recognition systems

Abstract : When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous studies of this impact have used very small amounts of data and recognition systems that lack modern adaptation strategies. This paper aims to rectify this by using a new audiovisual Lombard corpus containing speech from 54 different speakers – significantly larger than any previously available – and modern state-of-the-art speech recognition techniques. The paper is organised as three speech-in-noise recognition studies. The first examines the case in which a system is presented with Lombard speech having been exclusively trained on normal speech. It was found that the Lombard mismatch caused a significant decrease in performance even if the level of the Lombard speech was normalised to match the level of normal speech. However, the size of the mismatch was highly speaker-dependent thus explaining conflicting results presented in previous smaller studies. The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in recognition performance. Part of this is due to the greater energy leading to a reduction in noise masking, but performance improvements persist even after the effect of signal-to-noise level difference is compensated. An analysis across speakers shows that the Lombard speech energy is spectro-temporally distributed in a way that reduces energetic masking, and this reduction in masking is associated with an increase in recognition performance. The final study repeats the first two using a recognition system training on visual speech. In the visual domain, performance differences are not confounded by differences in noise masking. It was found that in matched-conditions Lombard speech supports better recognition performance than normal speech. The benefit was consistently present across all speakers but to a varying degree. Surprisingly, the Lombard benefit was observed to a small degree even when training on mismatched non-Lombard visual speech, i.e., the increased clarity of the Lombard speech outweighed the impact of the mismatch. The paper presents two generally applicable conclusions: i) systems that are designed to operate in noise will benefit from being trained on well-matched Lombard speech data, ii) the results of speech recognition evaluations that employ artificial speech and noise mixing need to be treated with caution: they are overly-optimistic to the extent that they ignore a significant source of mismatch but at the same time overly-pessimistic in that they do not anticipate the potential increased intelligibility of the Lombard speaking style.
Type de document :
Article dans une revue
Speech Communication, Elsevier : North-Holland, 2018, 100, pp.58-68. 〈10.1016/j.specom.2018.04.006〉
Liste complète des métadonnées

Littérature citée [54 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01779704
Contributeur : Ricard Marxer <>
Soumis le : dimanche 13 mai 2018 - 23:48:04
Dernière modification le : mardi 15 mai 2018 - 19:04:54

Fichier

1-s2.0-S0167639317302674-main....
Publication financée par une institution

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Collections

Citation

Ricard Marxer, Jon Barker, Najwa Alghamdi, Steve Maddock. The impact of the Lombard effect on audio and visual speech recognition systems. Speech Communication, Elsevier : North-Holland, 2018, 100, pp.58-68. 〈10.1016/j.specom.2018.04.006〉. 〈hal-01779704〉

Partager

Métriques

Consultations de la notice

81

Téléchargements de fichiers

32