Whisper BiDec

Biased Decoding for Whisper speech-to-text. The goal is to recognize unusual names or phrases with smaller Whisper models and lower beam sizes.

This repository implements a custom LogitsProcessor for HuggingFace's implementation of the OpenAI Whisper model which uses KenLM for word probabilities.

Install

Use script/setup or pip3 install -e . to install.

If you want to run the tests and linting scripts, use script/setup --dev or pip3 install -e .[dev] instead.

Running

Run python3 -m whisper_bidec --text <text_file> <wav_file> [<wav_file>...] to transcribe WAV files and get CSV output like:

path_to_1.wav|text 1 without bias|text 1 with bias
path_to_2.wav|text 2 without bias|text 2 with bias
...

The text file should contain a list of sentences that you want to bias Whisper towards. These need to have the correct casing and punctuation. You can add multiple --text files.

Increase --bias-towards-lm to get transcripts more like the example sentences (default: 0.5).

Increase --unk-logprob to allow more words outside of the example sentences (default: -5, must be less than 0) or decrease it to restrict words to example sentences (e.g., -10).

Method

To bias Whisper, an n-gram language model is trained on the example sentences. Because Whisper uses tokens internally (partial and whole words), the n-gram model is trained on the tokenized sentences. By default, this is a 5th order n-gram model (change with --ngram-order) with Witten-Bell smoothing.

During Whisper's beam search, the scores for the next tokens are computed using both the Whisper model's and the language model's probabilities. Setting --bias-towards-lm 0.5 (the default) means that the probabilities from the language model will be halved before being added to Whisper's probabilities.

When computing the language model probabilities, tokens outside of the example sentences are replaced with an "unknown" token (<unk>) whose (log) probability is set by --unk-logprob. A lower value will penalize sentences that use "unknown" tokens, while a value closer to 0 will allow them.

Some thoughts on the method:

Witten-Bell smoothing seems to do the best, but there may be better methods. Additionally, the implementation of it was mostly written by ChatGPT so it hasn't been fully verified.
Casing is a problem right now since Whisper wants to output case-sensitive sentences. It may be worth considering tokens that only differ in case as identical in the language model.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
script		script
tests		tests
whisper_bidec		whisper_bidec
.gitignore		.gitignore
.isort.cfg		.isort.cfg
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
LICENSE.md		LICENSE.md
README.md		README.md
mypy.ini		mypy.ini
pylintrc		pylintrc
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Whisper BiDec

Install

Running

Method

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

OHF-Voice/whisper-bidec

Folders and files

Latest commit

History

Repository files navigation

Whisper BiDec

Install

Running

Method

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages