Introducing NVIDIA Riva: A GPU-Accelerated SDK for Developing Speech AI Applications

Learn about the Riva SDK and its use in developing speech AI applications. We also discuss pretrained models in NGC, TAO Toolkit for transfer learning, and Riva optimized speech services for high-performance inference.

This post was updated from May 2020.

Speech AI is used in a variety of applications, including call centers for empowering human agents, speech interface for virtual assistants, and live captioning in video conferencing. Speech AI includes automatic speech recognition (ASR) and text-to-Speech (TTS). The ASR pipeline takes raw audio and converts it to text, and the TTS pipeline takes text and converts it to audio. 

Developing and running these real-time speech AI services is a complex and difficult task. Building speech AI applications requires hundreds of thousands of hours of audio data, tools to build and customize models based on your specific use case, and scalable deployment support. It also means running in real time, with far under 300 milliseconds to have natural interactions with users. NVIDIA Riva streamlines the end-to-end process of developing speech AI services and provides real-time performance for human-like interactions.

Riva SDK

NVIDIA Riva is a GPU-accelerated SDK for developing speech AI applications. Riva is designed to help you access conversational AI functionalities easily and quickly. With a few commands, you can access the high-performance services through API operations and try demos.

Figure 1. Riva workflow for building speech applications

The Riva SDK includes pretrained speech and language models, the NVIDIA TAO Toolkit for fine-tuning these models on a custom dataset, and optimized end-to-end skills for speech recognition, language understanding, and speech synthesis.

Using Riva, you can easily fine-tune state-of-art-models on your data to achieve a deeper understanding of their specific contexts. Optimize for inference to offer real-time services that run in 150 milliseconds (ms) compared to the 25 seconds required on CPU-only platforms.

Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR, NLP, and TTS. All these AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. You can start using the pretrained models or fine-tune them with your own dataset to further improve model performance.

Riva uses NVIDIA Triton Inference Server to serve multiple models for efficient and robust resource allocation, as well as to achieve high performance in terms of high throughput, low latency, and high accuracy.

Overview of Riva skills

Riva provides highly optimized services for speech recognition and speech synthesis for use cases like real-time transcription and virtual assistants. The speech recognition skill is trained and evaluated on a wide variety of real-world, domain-specific datasets. It includes vocabulary from telecommunications, podcasting, and healthcare to deliver world-class accuracy in production use cases.

The Riva text-to-speech or speech synthesis skill generates human-like speech and uses non-autoregressive models to deliver 12x higher performance on NVIDIA A100 GPUs when compared with Tacotron 2 and WaveGlow models on NVIDIA V100 GPUs. Furthermore, the service enables you to create a natural custom voice for every brand and virtual assistant with 30 mins of an actor’s data in a day.

Figure 2. Riva service capabilities

To take full advantage of the computational power of the GPUs, Riva is based on NVIDIA Triton Inference Server to serve neural networks and ensemble pipelines to run efficiently with NVIDIA TensorRT.

Riva services are exposed through API operations accessible by gRPC endpoints that hide all the complexity. Figure 3 shows the system’s server side. The gRPC API operations are exposed by the API server running in a Docker container. They are responsible for processing all the speech and NLP incoming and outgoing data.

Figure 3. Riva service pipelines

The API server sends inference requests to NVIDIA Triton and receives the results.

NVIDIA Triton is the backend server that processes multiple inference requests on multiple GPUs for many neural networks or ensemble pipelines simultaneously.

For conversational AI applications, it is crucial to keep the latency below a given threshold. This latency requirement translates into the execution of inference requests as soon as they arrive. To saturate the GPUs and increase performance, you must increase the batch size and delay the inference execution until more requests are received and a bigger batch can be formed.

NVIDIA Triton is also responsible for the context switch of networks with the state between one request and another.

Riva can be installed directly on bare-metal through simple scripts that download the appropriate models and containers from NGC, or it can be deployed on Kubernetes through an Helm chart, which is also provided.

Here’s a quick look at how you can interact with Riva. A Python interface makes communication with a Riva server easier on the client side through simple Python API operations. For example, here’s how a request for an existing TTS Riva service is created in three steps.

First, import the Riva API:

import src.riva_proto.riva_tts_pb2 as rtts
import src.riva_proto.riva_tts_pb2_grpc as rtts_srv
import src.riva_proto.riva_audio_pb2 as ri

Next, create a gRPC channel to the Riva endpoint:

channel = grpc.insecure_channel('localhost:50051')
riva_tts = rtts_srv.RivaSpeechSynthesisStub(channel)

Then, create a TTS request:

req = rtts.SynthesizeSpeechRequest()
req.text = "We know what we are, but not what we may be?"
req.language_code = "en-US"                	
req.encoding = ri.AudioEncoding.LINEAR_PCM 	
req.sample_rate_hz = 22050                 	
req.voice_name = "ljspeech"                	
resp = riva_tts.Synthesize(req)
audio_samples = np.frombuffer(resp.audio, dtype=np.float32)

Customizing a model with your data

Using the NVIDIA TAO Toolkit, you can use a custom trained model in Riva (Figure 4). NVIDIA TAO Toolkit is a no-coding tool for fine-tuning models on the domain-specific dataset.

Figure 4. NVIDIA TAO Toolkit pipeline

For instance, to further improve the legibility and accuracy of an ASR transcribed text, add a custom punctuation and capitalization model to the ASR system that generates text without those features.

Starting from a pretrained BERT model, the first step is to prepare the dataset. For every word in the training dataset, the goal is to predict the following:

  • The punctuation mark that should follow the word.
  • Whether the word should be capitalized.

After the dataset is ready, the next step is training by running a previously provided script. When the training is completed and the desired final accuracy is reached, create the model repository for NVIDIA Triton by using an included script.

The NVIDIA Riva Speech Skills documentation contains more details about how to train or fine-tune other models. This post showed only one among the many possibilities of customization using the TAO Toolkit.

Deploying a model in Riva

Riva is designed for conversation AI at scale. To help you efficiently serve models across different servers robustly, NVIDIA provides push-button model deployment using Helm charts (Figure 5).

Figure 5. Models can be deployed in Riva by modifying the available Helm chart

The Helm chart configuration, available from the NGC catalog, can be modified for custom use cases. You can change settings related to which models to deploy, where to store them, and how to expose the services.

Conclusion

Riva is available as an open beta to members of the NVIDIA Developer Program. For your real-time transcription, virtual assistants, or custom voice implementations, Riva is here to enable your development. If you are deploying at scale, Riva Enterprise deploys at scale for you and includes support from AI experts.

For more information, see Riva Getting Started.

Source:: NVIDIA