Skip to content

Speech Recognition: Generating Accurate Transcriptions Using NVIDIA Riva

Industries that commonly use AI include telco, financial services, healthcare, unified communication as a service, and retail.

Thousands of companies are using speech AI to interact with customers. Learn the benefits of using speech AI with the NVIDIA Riva end-to-end pipeline.

This post is part of a series about generating accurate speech transcription. For part 2, see Speech Recognition: Customizing Models to Your Domain Using Transfer Learning. For part 3, see Speech Recognition: Deploying Models to Production.

Every day, millions of audio minutes are produced across industries such as telecommunications, finance, and unified communications as a service (UCaaS). These audio minutes can be transcribed to empower call center agents with real-time recommendations, extract insights from customer call transcriptions, or generate live captioning in video conferencing meetings.

Figure 1. Speech AI in industries

Automatic speech recognition enables you to transcribe speech into text. Generating quality transcriptions is challenging as skills require understanding industry-specific jargon, hundreds to thousands of domain-specific audio minutes for training, and pipelines that run real-time. NVIDIA Riva speech recognition is a skill that delivers world-class accuracy for several common use cases across industries in real time.

In this post, we discuss Riva speech recognition. Subsequent posts discuss how you can customize the speech recognition model and deploy it as an optimized skill:

  • Customizing Speech Recognition Models to Your Domain Using TAO Toolkit
  • Deploying Speech Recognition Models to Production Using Riva

Riva speech recognition

Riva is a GPU-accelerated, AI-speech SDK for conversational AI applications such as real-time transcription and virtual assistants. Riva offers the following benefits:

  • Pretrained, state-of-the-art speech models in NGC
  • No-coding tools, such as the TAO Toolkit, for fine-tuning these models on a custom dataset
  • Optimized speech recognition and speech synthesis pipelines for high-performance inference

Video 1. Real-time Transcription with NVIDIA Riva Automatic Speech Recognition

The models underneath Riva are trained on hundreds to thousands of hours of open and real-world data with vocabulary from industries like telco, finance, healthcare, and education on NVIDIA supercomputers. The dataset samples were also generated from noisy environments, spontaneous speech conversations, multiple English accents, and different sampling rates. All these attributes contribute to generating noise-robust, high-quality transcriptions.

The Riva speech recognition skill is evaluated on various real-world use case datasets, including video conferencing, contact center, podcast, and technical videos. You can deploy these skills in the cloud, in the data center, and at the edge.

The Riva speech recognition pipeline provides support for new state-of-the-art architectures while maintaining accuracy. Figure 2 shows the improvement in speech accuracy over the last 3 years, achieved with new model architectures, training recipes, and the latest TensorRT and GPU-based optimizations.

Figure 2. Riva ASR accuracy improvement

With Riva, you can quickly deploy and scale to hundreds and thousands of concurrent streams in real-time latency in either streaming or batching mode.

For more information about using Riva to customize and deploy to your speech application, see the next post in this series, Speech Recognition: Customizing Models to Your Domain Using Transfer Learning. In part 3, we cover how to deploy a fine-tuned model. For more information, see Speech Recognition: Deploying Models to Production.

Source:: NVIDIA