Speech Recognition: Customizing Models to Your Domain Using Transfer Learning

Creating a new AI/DL model is a resource-intensive process. The NVIDIA TAO Toolkit can cut that time from 80 weeks to 8, using transfer learning.

This post is part of a series about generating accurate speech transcription. For part 1, see Speech Recognition: Generating Accurate Transcriptions Using NVIDIA Riva. For part 3, see Speech Recognition: Deploying Models to Production.

Creating a new AI deep learning model from scratch is an extremely time– and resource-intensive process. A common solution to this problem is to employ transfer learning. To make this process even easier, the NVIDIA TAO Toolkit, which can cut down an engineering time frame of 80 weeks to 8 weeks. The TAO Toolkit supports both computer vision and conversational AI (ASR and NLP) use cases.

In this post, we cover the following topics:

  • Installing the TAO Toolkit and get access to pretrained models
  • Fine-tuning a pretrained speech transcription model
  • Exporting the fine-tuned model to NVIDIA Riva

To follow along, download the Jupyter notebook.

Installing the TAO Toolkit and downloading pretrained models

Before installing the TAO Toolkit, make sure you have the following installed on your system:

  • python >= 3.6.9
  • docker-ce > 19.03.5
  • nvidia-docker2 3.4.0-1

For more information about installing nvidia-docker and docker, see Prerequisites. You can install the TAO Toolkit with pip. We recommend using a virtual environment to avoid version conflicts.

pip3 install nvidia-pyindex
pip3 install nvidia-tao

With installation out of the way, the next step is to get some pretrained models. NVIDIA has made available many AI or machine learning models, not just in the conversational AI space but in a wide range of domains on NGC or NVIDIA GPU Cloud. The NGC Catalog is a curated set of GPU-optimized software for AI, HPC, and visualization.

To download resources from NGC, log in to the registry with your NGC API key. You can create and use one for free.

Figure 1. Getting the NGC API Key

CitriNet is a state-of-the-art automatic speech recognition (ASR) model built by NVIDIA, which enables you to generate speech transcriptions. You can download this model from the Speech to Text English Citrinet model card.

wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/speechtotext_english_citrinet/versions/trainable_v1.7/files/speechtotext_english_citrinet_1024.tlt

To offer a fluid and streamlined experience, the toolkit downloads and runs Docker containers in the background that makes use of the previously mentioned specification files. All the details are hidden with the TAO launcher. You specify your preferred location to mount the Docker container by defining a JSON file: ~/.tao_mounts.json. You can find the mount file in the Jupyter notebook.

{
   "Mounts":[
       {
           "source": "~/tao/data",
           "destination": "/data" # The location in which to store the dataset
       },
       {
           "source": "~/tao/specs",
           "destination": "/specs" # The location in which to store the specification files
       },
       {
           "source": "~/tao/results",
           "destination": "/results" # The location in which to store the results
       },
       {
           "source": "~/.cache",
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions":{
         "shm_size": "16G",
         "ulimits": {
            "memlock": -1,
            "stack": 67108864
     }
   }
}

With this, you have TAO Toolkit installed, have downloaded a pretrained ASR model, and specified the mounting point for the TAO Toolkit launcher. In the next section, we discuss how to use TAO Toolkit to fine-tune this model on a dataset of your choice.

Fine-tuning the model

Fine-tuning a model with TAO Toolkit is a three-step process:

  • Download the spec files.
  • Preprocess the dataset.
  • Fine-tune with hyperparameters.
  • Figure 3 shows the steps needed to fine-tune the model.

    Figure 2. TAO Toolkit workflow

    Step 1: Download spec files

    NVIDIA TAO Toolkit is a low– or no-code solution to simplify the training or fine-tuning of models, through specification files. These files enable you to customize model-specific parameters, trainer parameters, optimizer, and parameters for the dataset being used. These specifications files can be downloaded to the folder mounted earlier:

    tao speech_to_text_citrinet download_specs 
        -r /speech_to_text_citrinet 
        -o /speech_to_text_citrinet

    Here are the YAML files that come with the TAO toolkit. For more information, see Downloading Sample Spec Files.

    • create_tokenizer.yaml
    • dataset_convert_an4.yaml
    • dataset_convert_en.yaml
    • dataset_convert_ru.yaml
    • evaluate.yaml
    • export.yaml
    • finetune.yaml
    • infer_onnx.yaml
    • infer.yaml
    • train_citrinet_256.yaml
    • train_citrinet_bpe.yaml

    These specification files are available for customization and use. There is functionality for everything from preprocessing and model evaluation to inference and exporting the model. This enables you through the journey of developing or customizing models without the need to build elaborate code bases. With the spec files downloaded, you can now proceed to preprocessing the data.

    Step 2: Preprocess the dataset

    For this walkthrough, you use CMU’s AN4 Dataset, a small census dataset that contains recordings of addresses, numbers, and other personal information. This is similar to the type of transcription that is required in the initial steps of conversations happening in customer support conversations. A larger custom dataset with similar content can be used for a real-world application.

    You can directly download and unzip the AN4 dataset or use the following command:

    wget http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz
    tar -xvf an4_sphere.tar.gz

    The TAO Toolkit training and fine-tuning modules expect data to be present in a specific format. This preprocessing can be done using the dataset_convert command. We package specification files for AN4 and Mozilla’s common voice dataset along with the TAO launcher. You can find these specification files in the directory that you defined in Step 1.

    These manifest files (Figure 3) contain the following information that is used in the later steps:

    • Path to audio files
    • Duration of each file
    • Content of each file in words

    Figure 3. Structure of the processed manifest files

    tao speech_to_text_citrinet dataset_convert 
        -e /speech_to_text_citrinet/dataset_convert_an4.yaml 
        -r  /citrinet/dataset_convert 
        source_data_dir= /an4 
        target_data_dir=/an4_converted

    This command converts the audio files to WAV files and generates a train and test manifest files. For more information, see Preparing the Dataset.

    In most cases, you would have been done with preprocessing but the CitriNet model is a special case. It requires further processing in the form of subword tokenization, which creates a subword vocabulary for the text. This is different from Jasper or QuartzNet because only single characters are regarded as elements in the vocabulary in their cases. In CitriNet, the subword can be one or multiple characters. This can be done using the following command:

    tao speech_to_text_citrinet create_tokenizer 
    -e /speech_to_text_citrinet/create_tokenizer.yaml 
    -r /citrinet/create_tokenizer 
    manifests=/an4_converted/train_manifest.json 
    output_root=/an4 
    vocab_size=32

    Up to this point, you’ve set up a tool that provides a low-code or no-code solution for a complex problem like transfer learning. You’ve downloaded a pretrained model, processed audio files into the necessary format, and performed tokenization. You did all this with fewer than 10 commands. Now that all the necessary details have been hashed out, you can proceed to fine-tuning the model.

    Step 3: Fine-tuning with hyperparameters

    As you did in the previous steps, you are interacting with a specification file. For more information, see Creating an Experiment Spec File. You can specify almost everything from training specific parameters like the optimizer, to dataset-specific parameters, to the model configuration itself, if you want to adjust the size of the window size for FFT.

    Do you want to change the learning rate and the scheduler, and maybe add a new character in the vocabulary? There’s no need to open your code base and scan through it to make the changes. All these customizations are easily available and shareable across your team. This reduces friction around trying new ideas and sharing the results, as well as the configurations of the models that had better accuracy.

    Here’s how to fine-tune the trainer:

    trainer:
      max_epochs: 3   # This is low for demo purposes
    tlt_checkpoint_interval: 1
    
    change_vocabulary: true

    Here’s how to fine-tune the tokenizer:

    tokenizer:
      dir: /path/to/subword/vocabulary
      type: "bpe"   # Can be either bpe or wpe

    Here’s how to fine-tune the optimizer:

    optim:
      name: novograd
      lr: 0.01
      betas: [0.8, 0.5]
      weight_decay: 0.001
    
      sched:
        name: CosineAnnealing
        warmup_steps: null
        warmup_ratio: null
        min_lr: 0.0
        last_epoch: -1

    Here’s how to fine-tune the dataset:

    # Fine-tuning settings: validation dataset
    validation_ds:
      manifest_filepath: /path/to/manifest/file/
      sample_rate: 16000
      labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
      batch_size: 32
      shuffle: false
    
    finetuning_ds:
      manifest_filepath: ???
      sample_rate: 160000
      labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
      batch_size: 32
      trim_silence: true
      max_duration: 16.7
      shuffle: true
      is_tarred: false
      tarred_audio_filepaths: null

    Finally, to proceed, modify the specification file as required and run the following command. This command fine-tunes the downloaded model using the dataset downloaded earlier. For more information, see Fine-Tuning the Model.

    tao speech_to_text_citrinet finetune 
         -e $SPECS_DIR/speech_to_text_citrinet/finetune.yaml 
         -g 1 
         -k  
         -m /speechtotext_english_citrinet_1024.tlt 
         -r $RESULTS_DIR/citrinet/finetune 
         finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json 
         validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json 
         trainer.max_epochs=1 
         finetuning_ds.num_workers=1 
         validation_ds.num_workers=1 
         trainer.gpus=1 
         tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v32

    After fine-tuning or training your model, it is natural to evaluate the model and assess if further fine-tuning is required. To that end, NVIDIA provides capabilities to evaluate your model and run inference.

    Exporting the fine-tuned model to Riva

    Deploying a model in a production environment presents its own set of challenges. To that end, you can use NVIDIA Riva, a GPU-accelerated AI speech SDK for developing applications like real-time transcription and virtual assistants.

    Riva makes use of other NVIDIA products:

    • NVIDIA Triton Inference Server is used to simplify the deployment of models at scale in production.
    • NVIDIA TensorRT is used to accelerate the models and provide better inference performance by optimizing the models for NVIDIA GPUs.

    If you are interested in using the model fine-tuned in this walkthrough, you can export it to Riva using the following command. For more information, see Model Export.

    tao speech_to_text_citrinet export 
         -e /speech_to_text_citrinet/export.yaml 
         -g 1 
         -k  
         -m /citrinet/train/checkpoints/trained-model.tlt 
         -r /citrinet/riva 
         export_format=RIVA 
         export_to=asr-model.riva

    What’s next?

    Citrinet for Speech Transcription isn’t the only model or use case that NVIDIA provides. There are multiple use cases and pretrained models in conversational AI and computer vision. For more information, see the NVIDIA TAO Toolkit product page.

    In the next post, we cover how to install NVIDIA Riva to deploy these models in a production environment and using one of the many models in the NGC Catalog. For more information, see Speech Recognition: Deploying Models to Production.

    Source:: NVIDIA