arxiv:2501.11378

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio

Published on Jan 20, 2025

Authors:

Abstract

Whisper ASR model hallucinations caused by non-speech audio segments can be mitigated through post-processing techniques that reduce word error rate.

Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2501.11378

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

No model linking this paper

Cite arxiv.org/abs/2501.11378 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2501.11378 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2501.11378 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.