The impact of the Lombard effect on audio and visual speech recognition systems

Ricard Marxer; Jon Barker; Najwa Alghamdi; Steve Maddock

doi:10.1016/j.specom.2018.04.006

Article Dans Une Revue Speech Communication Année : 2018

The impact of the Lombard effect on audio and visual speech recognition systems

(1, 2, 3) , (1) , (1) , (1)

1
2
3

Ricard Marxer

Fonction : Auteur
PersonId : 19391
IdHAL : ricard-marxer
ORCID : 0000-0001-5099-5059
IdRef : 240437713

University of Sheffield [Sheffield]

Laboratoire d'Informatique et des Systèmes (LIS) (Marseille, Toulon)

DYNamiques de l’Information

Jon Barker

Fonction : Auteur

University of Sheffield [Sheffield]

Najwa Alghamdi

Fonction : Auteur

University of Sheffield [Sheffield]

Steve Maddock

Fonction : Auteur

University of Sheffield [Sheffield]

Résumé

When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous studies of this impact have used very small amounts of data and recognition systems that lack modern adaptation strategies. This paper aims to rectify this by using a new audiovisual Lombard corpus containing speech from 54 different speakers – significantly larger than any previously available – and modern state-of-the-art speech recognition techniques. The paper is organised as three speech-in-noise recognition studies. The first examines the case in which a system is presented with Lombard speech having been exclusively trained on normal speech. It was found that the Lombard mismatch caused a significant decrease in performance even if the level of the Lombard speech was normalised to match the level of normal speech. However, the size of the mismatch was highly speaker-dependent thus explaining conflicting results presented in previous smaller studies. The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in recognition performance. Part of this is due to the greater energy leading to a reduction in noise masking, but performance improvements persist even after the effect of signal-to-noise level difference is compensated. An analysis across speakers shows that the Lombard speech energy is spectro-temporally distributed in a way that reduces energetic masking, and this reduction in masking is associated with an increase in recognition performance. The final study repeats the first two using a recognition system training on visual speech. In the visual domain, performance differences are not confounded by differences in noise masking. It was found that in matched-conditions Lombard speech supports better recognition performance than normal speech. The benefit was consistently present across all speakers but to a varying degree. Surprisingly, the Lombard benefit was observed to a small degree even when training on mismatched non-Lombard visual speech, i.e., the increased clarity of the Lombard speech outweighed the impact of the mismatch. The paper presents two generally applicable conclusions: i) systems that are designed to operate in noise will benefit from being trained on well-matched Lombard speech data, ii) the results of speech recognition evaluations that employ artificial speech and noise mixing need to be treated with caution: they are overly-optimistic to the extent that they ignore a significant source of mismatch but at the same time overly-pessimistic in that they do not anticipate the potential increased intelligibility of the Lombard speaking style.

Mots clés

Automatic speech recognition Multimodal speech Lombard speech Intelligibility Robust speech processing Visual speech

Domaines

Traitement du signal et de l'image [eess.SP] Informatique et langage [cs.CL]

Fichier principal

1-s2.0-S0167639317302674-main.pdf (994.9 Ko)

Origine : Publication financée par une institution

Ricard Marxer : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01779704

Soumis le : dimanche 13 mai 2018-23:48:04

Dernière modification le : vendredi 22 mars 2024-18:24:03

Archivage à long terme le : lundi 24 septembre 2018-16:11:19

Dates et versions

hal-01779704 , version 1 (13-05-2018)

Licence

Paternité

Identifiants

HAL Id : hal-01779704 , version 1
DOI : 10.1016/j.specom.2018.04.006

Citer

Ricard Marxer, Jon Barker, Najwa Alghamdi, Steve Maddock. The impact of the Lombard effect on audio and visual speech recognition systems. Speech Communication, 2018, 100, pp.58-68. ⟨10.1016/j.specom.2018.04.006⟩. ⟨hal-01779704⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU LIS-LAB

170 Consultations

159 Téléchargements

The impact of the Lombard effect on audio and visual speech recognition systems

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager