Skip to content

Entropy-Based Methods for Word-Level ASR Confidence Estimation

Person standing in front of waterfall

Once you have your automatic speech recognition (ASR) model predictions, you may also want to know how likely those predictions are to be correct. This…

Once you have your automatic speech recognition (ASR) model predictions, you may also want to know how likely those predictions are to be correct. This probability of correctness, or confidence, is often measured as a raw prediction probability (fast, simple, and likely useless). You can also train a separate model to estimate the prediction confidence (accurate, but complex and slow). This post explains how to achieve fast, simple word-level ASR confidence estimation using entropy-based methods.

Overview of confidence estimation

Have you ever seen a machine learning model prediction and wondered how accurate that prediction is? You can make a guess based on the accuracy measured on a similar test case. For example, suppose you know that your ASR model predicts words from recorded speech with a Word Error Rate (WER) of 10%. In that case, you can expect that every word this model recognizes is 90% accurate. 

Such a rough estimate may be enough for some applications, but what if you want to know with certainty which word is more likely to be correct and which is not? This will require using prediction information beyond the actual words, such as the exact prediction probabilities received from the model.

Using raw prediction probabilities as a measure of confidence is the fastest way to tell which prediction is more likely to be correct, which is less likely to be correct, and which is the simplest to implement.

But are raw probabilities useful as confidence estimations? Not really. Figure 1, for example, is confidence computed for greedy search recognition results: frame by frame for linguistic units with the highest probability (“maximum probability” or ) and then aggregated into word-level scores. What you can see there is called “overconfidence,” a feature of the model when its prediction probability distribution is skewed towards the best hypothesis. 

In other words, the model almost always gives a probability close to one for one possible prediction and zero for any other. Thus, even if the prediction is incorrect, its probability will often be greater than 0.9. Overconfidence has nothing to do with the model’s architecture or its components: simple convolutional or recurrent neural networks can have the same overconfidence degree as transformer- or conformer-like models. 

Overconfidence comes from the loss functions used to train end-to-end ASR models. Simple losses reach a minimum when the target prediction probability is maximized, and any others have zero probability. These include losses like the cross-entropy, ubiquitous connectionist temporal classification (CTC) or recurrent neural network transducer (RNN-T), or any other maximum likelihood family loss functions. 

Overconfidence makes prediction probabilities unnatural. It is difficult to set the correct threshold separating correct and incorrect predictions, which makes using raw probabilities as confidence nearly useless.

Figure 1. Stacked log histograms of correctly and incorrectly recognized words against their confidence scores. Subfigures (a) and (d) are obtained with the mean of the predictions of frames belonging to the same word, (b) and (e) with the minimum, and (c) and (f) with the product ().

An alternative to using raw probabilities as confidence is to create a separate trainable confidence model to estimate confidence based on the probabilities and, optionally, ASR model embeddings. This approach can deliver fairly accurate confidence at the cost of training the estimator for each model, incorporating it into the inference along with the main model (and inevitably slowing the inference down), and lack of interpretability. 

However, if you want the best possible measure of correctness and are comfortable with the system where a neural network estimates another neural network, try neural confidence estimators.

Entropy-based confidence estimation

This section features a non-trainable yet effective confidence estimation approach. You will learn how to make fast, simple, robust, and adjustable confidence estimation methods for CTC and RNN-T ASR models in the greedy search recognition mode.

A simple entropy-based confidence measure

Treating confidence simply as the raw prediction probability is a no-go, as explained above. Extending this definition to a function of all available knowledge on the interval [0,1] is preferable. Taking prediction probabilities “as is” falls under this definition. This approach enables you to experiment with adding external knowledge to the estimation or using the existing information (probabilities) in new ways.

Confidence must stay true to its primary purpose, which is to be a measure of correctness. At a minimum, it must behave as such: assign higher values to predictions that are more likely to be correct.

As a confidence measure, entropy is theoretically justified and works fairly well. In information theory, entropy is a measure of uncertainty, which calculates the uncertainty value based on probabilities of all possible outcomes.

This is exactly what is needed for ASR with the greedy decoding mode, where there is only one prediction per probability vector. Simply assign the entropy value to the prediction. Then invert the entropy value (to make it “certainty”) and normalize it (map it to [0,1]), as shown below.

 

is the Gibbs entropy (which is more convenient than the Shannon entropy when dealing with natural logarithmic probabilities), is the number of possible predictions (vocabulary size for ASR models), and is the Gibbs entropy confidence. Note that if the model is completely unsure (all probabilities are 1/), confidence will be zero even if there is still a 1/ chance of guessing the prediction correctly.

This normalization is convenient because it does not depend on the number of possible functions and gives you a simple rule of thumb: accept those close to one, discard those close to zero.

Advanced entropy-based confidence measures

The entropy-based confidence estimate above is quite applicable in that form, but there is still room for improvement. While theoretically correct, it will never show values ​​close to zero (when all probabilities are equal) in practice due to the model’s overconfidence. 

Moreover, it cannot address different degrees of overconfidence in this form. The second issue is tricky, but the first can be solved with a different normalization. Exponentiation will help here. When a negative number is raised to a power, the result will lie in the interval [0,1] and close to zero for most arguments. 

With this property, you can normalize entropy using the following formula:

is the exponentially normalized Gibbs entropy confidence. 

To make entropy-based confidence truly robust to overconfidence requires a method called temperature scaling. This method multiplies log-softmax by a number 0

Source:: NVIDIA