How to Create a Custom Language Model

Abstract image

Generative AI has captured the attention and imagination of the public over the past couple of years. From a given natural language prompt, these generative…

Generative AI has captured the attention and imagination of the public over the past couple of years. From a given natural language prompt, these generative models are able to generate human-quality results, from well-articulated children’s stories to product prototype visualizations. 

Large language models (LLMs) are at the center of this revolution. LLMs are universal language comprehenders that codify human knowledge and can be readily applied to numerous natural and programming language understanding tasks, out of the box. These include summarization, translation, question answering, and code annotation and completion. 

The ability of a single foundation language model to complete many tasks opens up a whole new AI software paradigm, where a single foundation model can be used to cater to multiple downstream language tasks within all departments of a company. This simplifies and reduces the cost of AI software development, deployment, and maintenance.

Introduction to creating a custom large language model 

While potent and promising, there is still a gap with LLM out-of-the-box performance through zero-shot or few-shot learning for specific use cases. In particular, zero-shot learning performance tends to be low and unreliable. Few-shot learning, on the other hand, relies on finding optimal discrete prompts, which is a nontrivial process. 

As explained in GPT Understands, Too, minor variations in the prompt template used to solve a downstream problem can have significant impacts on the final accuracy. In addition, few-shot inference also costs more due to the larger prompts.

Parameter-efficient fine-tuning techniques have been proposed to address this problem. Prompt learning is one such technique, which appends virtual prompt tokens to a request. These virtual tokens are learnable parameters that can be optimized using standard optimization methods, while the LLM parameters are frozen. 

This post walks through the process of customizing LLMs with NVIDIA NeMo, a universal framework for training, customizing, and deploying foundation models. 

What is NVIDIA NeMo?

NVIDIA NeMo is the universal framework for training, customizing, and deploying large-scale foundation models. NeMo takes advantage of various parallelism techniques to accelerate training and inference, and can be deployed on multi-node, multi-GPU systems on user-preferred cloud, on-premises, and edge systems. To learn more, see NVIDIA AI Platform Delivers Big Gains for Large Language Models and Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server.

The NeMo ecosystem consists of the following main components:

  • NVIDIA NeMo service: Provides a fast path to productionizing LLMs through NVIDIA-managed services. Developers can quickly and easily develop enterprise AI applications leveraging LLM capabilities without worrying about the underlying infrastructure. You can also experience Megatron 530B—one of the largest language models—through the cloud API or a web playground interface. Currently in early access.
  • NVIDIA NeMo framework: An end-to-end containerized framework that allows developers to efficiently train and deploy language models with billions and trillions of parameters, delivering high training efficiency across thousands of GPUs. NeMo framework container is currently in open beta and available through NGC.
  • NVIDIA/NeMo: An open-source conversational AI toolkit built for researchers working on speech AI and NLP, including LLM. Available through GitHub.
  • NeMo models: NVIDIA recently open-sourced pretrained NeMo framework models, ranging from small models such as the 1.3B GPT-3, 5B GPT-3, 3B mT5 model to large models, such as the 20B GPT-3.
  • NVIDIA/FasterTransformer: An open-source toolkit for high-performance inference of LLMs available through GitHub. To learn more about how to deploy the public NeMo framework models with FasterTransformer, see Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron.
  • This post explains how to use the NeMo framework container to customize a public NeMo model with prompt-learning techniques.

    Prompt learning with NeMo

    Prompt learning collectively refers to two parameter-efficient fine-tuning techniques, as detailed below. For more information, see Adapting P-Tuning to Solve Non-English Downstream Tasks.

    • In prompt-tuning, soft prompt embeddings are initialized as a 2D matrix. Each task has its own 2D embedding matrix associated with it. Tasks do not share any parameters during training or inference. All LLM parameters are frozen and only the embedding parameters for each task are updated during training. NeMo prompt tuning implementation is based on The Power of Scale for Parameter-Efficient Prompt Tuning.
    • In p-tuning, an LSTM model, or “prompt encoder,” is used to predict virtual token embeddings. LSTM parameters are randomly initialized at the start of p-tuning. All LLM parameters are frozen, and only the LSTM weights are updated at each training step. LSTM parameters are shared between all tasks that are p-tuned at the same time, but the LSTM model outputs unique virtual token embeddings for each task. NeMo p-tuning implementation is based on GPT Understands, Too.

    Prompt learning for this example uses two open-source components of the NeMo ecosystem: the NeMo OSS toolkit and public NeMo models.

    The process of prompt learning on a small GPT-3 345M parameter model is detailed in the NeMo Multitask Prompt and P-Tuning tutorial on GitHub. This tutorial demonstrates the end-to-end process of prompt learning: downloading and preprocessing data, downloading the model, training a prompt-learning model, and inferring on three different applications.

    The sections below first walk through the notebook while summarizing the main concepts. Then this notebook will be extended to carry out prompt learning on larger NeMo models.

    Prerequisite

    You can experience NeMo through the NeMo Docker container. This provides a self-sufficient and reproducible environment for experimenting with NeMo. The NeMo Multitask Prompt and P-Tuning tutorial was tested with the NeMo 22.09 container, but you can try later releases of the same container. Download and run this container using the following script:

    docker run  -u $(id -u ${USER}):$(id -g ${USER}) --rm -it --net=host nvcr.io/nvidia/nemo:22.09 bash

    Then from within the container interactive bash environment, start Jupyter lab:

    cd /workspace
    jupyter lab --ip 0.0.0.0 --allow-root --port=8888
    

    From Jupyter lab, you will find NeMo examples, including the above-mentioned notebook,  under /workspace/nemo/tutorial/nlp/Multitask_Prompt_and_PTuning.ipynb. 

    In addition, you will need one GPU for working with the smaller 5B and 1.3B GPT-3 models, and four NVIDIA Ampere architecture or NVIDIA Hopper architecture GPUs to work with the 20B model, as it has four degrees of tensor parallelism (TP). 

    Data preparation

    The notebook will walk you through the data collection and preprocessing process for three different applications: Financial PhraseBank dataset for the sentiment analysis task, SQuAD dataset for the question/answer task, and Assistant Benchmarking dataset for the intent and slot classification task. 

    The dataset should be in a .jsonl format containing a collection of JSON objects. Each JSON object must include the field task name, which is a string identifier for the task the data example corresponds to. Each should also include one or more fields corresponding to different sections of the discrete text prompt. See Figure 1 for an example.

    Screenshot of dataset format for NVIDIA NeMo prompt learning.Figure 1. Dataset format for NeMo prompt learning

    Prompt template

    You should determine and adhere to a pattern when forming the prompt. This pattern is called the prompt template and varies according to the use case. An example for sentiment analysis is shown below.

    {
            "taskname": "sentiment",
            "prompt_template": "<|VIRTUAL_PROMPT_0|> {sentence} sentiment:{label}",
            "total_virtual_tokens": 10,
            "virtual_token_splits": [10],
            "truncate_field": None,
            "answer_only_loss": True,
            "answer_field": "label",
        }
    

    The prompt contains all the 10 virtual tokens at the beginning, followed by the target sentence to classify. Next is a text marker (“sentiment:”), and finally the label of the sentence for training. The corresponding fields in the training data JSON object will be mapped to this prompt template to form complete training examples. NeMo supports pruning specific fields to meet the model token length limit (typically 2,048 tokens for NeMo public models using the HuggingFace GPT-2 tokenizer).

    Training

    The default NeMo prompt-tuning configuration is provided in a yaml file, available through NVIDIA/NeMo on GitHub. The notebook loads this yaml file, then overrides the training options to suit the 345M GPT model. The NeMo p-tuning enables multiple tasks to be learned concurrently. NeMo leverages the PyTorch Lightning interface, so training can be done as simply as invoking a trainer.fit(model) statement.

    Inference

    Finally, once trained, the model can be used for inference on new samples (omitting the “answer_field”) by invoking the model.generate(inputs=test_examples) statement.

    Prompt learning on larger models

    The 345M GPT-3 model process demonstrated in the notebook can be applied to larger public NeMo GPT-3 models, up to 1.3B GPT-3 and 5B GPT-3. Models of this size require only a single GPU of sufficient memory capacity, such as the NVIDIA V100, NVIDIA A100, and NVIDIA H100. After downloading the model, substitute the model name; in particular, in the following cell:

    # Download the model from NGC
    gpt_file_name = "megatron_gpt_345m.nemo"
    !wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/megatron_gpt_345m/versions/1/files/megatron_gpt_345m.nemo -

    Instead of downloading the 345M GPT model from NGC, download either the 1.3B GPT-3 or 5B GPT-3 models following the instructions on HuggingFace, then point the gpt_file_name variable to the .nemo model file. 

    Note that for the 5B model there are two variants, one with a TP degree of 1 (nemo_gpt5B_fp16_tp1.nemo) and other with TP=2 (nemo_gpt5B_fp16_tp2.nemo, nemo_gpt5B_bf16_tp2.nemo). The notebook can only support the TP=1 variant. With everything else left unchanged, you can execute the same notebook end-to-end. 

    Multi-GPU prompt learning

    Due to the limitations of the Jupyter notebook environment, the prompt learning notebook only supports single-GPU training. Leveraging multi-GPU training for larger models, with a higher degree of TP (such as 4 for the 20B GPT-3, and 2 for other variants for the 5B GPT-3) requires use of a different NeMo prompt learning script. This script is supported by a config file where you can find the default values for many parameters.

    Models

    This section demonstrates the process of prompt learning of a large model using multiple GPUs on the assistant dataset that was downloaded and preprocessed as part of the prompt learning notebook.

    You can download either the 5B GPT model with TP=2 (nemo_gpt5B_fp16_tp2.nemo) or the 20B GPT-3 model with TP=4. Note that these models are stored in a .nemo zip archive. To speed up model loading substantially, unzip the model beforehand, and use this unzipped folder in the NeMo configuration. Use the following script:

    tar -xvf nemo_gpt5B_fp16_tp2.nemo -C nemo_gpt5B_fp16_tp2.nemo.extracted

    Then use the extracted directory nemo_gpt5B_fp16_tp2.nemo.extracted in NeMo config.

    Configuration

    A configuration file suitable for the assistant dataset (intent and slot detection application) is shown below:

    name: megatron_virtual_prompt_gpt
    
    trainer:
      devices: 2
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: False # logger provided by exp_manager
      enable_checkpointing: False
      replace_sampler_ddp: False
      max_epochs: 25 # min 25 recommended
      max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
      log_every_n_steps: 10 # frequency with which training steps are logged 
      val_check_interval: 1.0 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
      gradient_clip_val: 1.0
      resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
      benchmark: False
    
    
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: False
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: True
      resume_ignore_no_checkpoint: True
      create_checkpoint_callback: True
      checkpoint_callback_params:
        monitor: val_loss
        save_top_k: 2
        mode: min
        save_nemo_on_train_end: False # Should be false, correct prompt learning model file is saved at model.nemo_path set below, 
        filename: 'megatron_gpt_prompt_tune--{val_loss:.3f}-{step}'
        model_parallel_size: ${model.tensor_model_parallel_size}
        save_best_model: True
    
    model:
      seed: 1234
      nemo_path: ${name}.nemo # .nemo filename/absolute path to where the virtual prompt model parameters will be saved
      virtual_prompt_style: 'p-tuning' # one of 'prompt-tuning', 'p-tuning', or 'inference'
      tensor_model_parallel_size: 1 # intra-layer model parallelism
      pipeline_model_parallel_size: 1 # inter-layer model parallelism
      global_batch_size: 8
      micro_batch_size: 4
    
      restore_path: null # Path to an existing p-tuned/prompt tuned .nemo model you wish to add new tasks to or run inference with
      language_model_path: ??? # Path to the GPT language model .nemo file, always required
      save_nemo_on_validation_end: True # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
      existing_tasks: [] # List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given
      new_tasks: ['intent_and_slot'] # List of new tasknames to be prompt-tuned
      
    
    
      ## Sequence Parallelism
      # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
      # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
      sequence_parallel: False
    
      ## Activation Checkpoint 
      activations_checkpoint_granularity: null # 'selective' or 'full' 
      activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
      # 'uniform' divides the total number of transformer layers and checkpoints the input activation
      # of each chunk at the specified granularity
      # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
      activations_checkpoint_num_layers: null # not used with 'selective'
    
      task_templates: # Add more/replace tasks as needed, these are just examples
      - taskname: "intent_and_slot"    
        prompt_template: "<|VIRTUAL_PROMPT_0|>Predict intent and slot: {utterance} nLabel:{label}"
        total_virtual_tokens: 10
        virtual_token_splits: [10]
        truncate_field: null
        answer_only_loss: False
        "answer_field": "label"
    
    
      prompt_tuning: # Prompt tunin specific params
        new_prompt_init_methods: ['text'] # List of 'text' or 'random', should correspond to tasks listed in new tasks
        new_prompt_init_text: ['some init text goes here'] # some init text if init method is text, or None if init method is random
    
      p_tuning: # P-tuning specific params
        encoder_type: "tpmlp" # ['tpmlp', 'lstm', 'biglstm', 'mlp'] 
        dropout: 0.0
        num_layers: 2  # number of layers for MLP or LSTM layers. Note, it has no effect for tpmlp currently as it always assumes it is two layers.
        encoder_hidden: 2048 # encoder hidden for biglstm and tpmlp
        init_std: 0.023  # init std for tpmlp layers
    
      data:
        train_ds: ???
        validation_ds: ???
        add_eos: True
        shuffle: True
        num_workers: 8
        pin_memory: True
        train_cache_data_path: null  # the path to the train cache data 
        validation_cache_data_path: null  # the path to the validation cache data 
        test_cache_data_path: null  # the path to the test cache data 
        load_cache: False  # whether to load from the cache data
    
    
      optim:
        name: fused_adam
        lr: 1e-4
        weight_decay: 0.01 
        betas: 
        - 0.9
        - 0.98
        sched:
          name: CosineAnnealing
          warmup_steps: 50
          min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
          constant_steps: 0 # Constant steps should also be 0 when min_lr=0
          monitor: val_loss
          reduce_on_plateau: false
    

    Thanks to the yaml text format and comments, most of the hyperparameters are self-explanatory. Using the Jupyter lab interface, create a file with this content and save it under /workspace/nemo/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_intent_n_slot.yaml.

    Most important in the config file is the prompt template shown below:

     prompt_template: "<|VIRTUAL_PROMPT_0|>Predict intent and slot: {utterance} nLabel:{label}"
        total_virtual_tokens: 10
        virtual_token_splits: [10]
        truncate_field: null
    

    Here, 10 virtual prompt tokens are used together with some permanent text markers. 

    Training

    To begin training, open a terminal window from within the Jupyter lab interface (File → New  → Terminal). Then issue the bash command:

    python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_prompt_learning.py 
        	--config-name=megatron_gpt_prompt_learning_intent_n_slot.yaml 
        	trainer.devices=2 
        	trainer.num_nodes=1 
        	trainer.max_epochs=25 
        	trainer.precision=bf16 
        	model.language_model_path=/workspace/nemo/tutorials/nlp/nemo-megatron-gpt-5B/nemo_gpt5B_fp16_tp2.nemo.extracted 
        	model.nemo_path=/workspace/nemo/examples/nlp/language_modeling/intent_n_slot.nemo 
        	model.tensor_model_parallel_size=2 
        	model.pipeline_model_parallel_size=1 
        	model.global_batch_size=16 
        	model.micro_batch_size=1 
        	model.optim.lr=1e-4 
        	model.data.train_ds=[/workspace/nemo/tutorials/nlp/data/assistant/assistant_train.jsonl] 
        	model.data.validation_ds=[/workspace/nemo/tutorials/nlp/data/assistant/assistant_val.jsonl]

    Note the following:

    • model.tensor_model_parallel_size should be set to 2 for the 5B GPT model (nemo_gpt5B_fp16_tp2.nemo) or 4 for the 20B GPT-3 model
    • trainer.devices should be set to a multiple of the TP value. If 4 for the 5B models, there will be two data-parallel workers, each with two GPUs
    • model.language_model_path should be set to the absolute path of the model extracted directory
    • model.data.train_ds, model.data.validation_ds should be set to the location of the train and validation data

    Inference

    Finally, once trained, carry out inference in NeMo using the following script:

    python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_prompt_learning_eval.py 
                virtual_prompt_model_file=/workspace/nemo/examples/nlp/language_modeling/intent_n_slot.nemo 
                gpt_model_file=/workspace/nemo/tutorials/nlp/nemo-megatron-gpt-5B/nemo_gpt5B_fp16_tp2.nemo.extracted  
                inference.greedy=True 
                inference.add_BOS=False 
                inference.tokens_to_generate=128 
                trainer.devices=2 
                trainer.num_nodes=1 
                tensor_model_parallel_size=2 
                pipeline_model_parallel_size=1 
                data_paths=["/workspace/nemo/tutorials/nlp/data/assistant/assistant_test.jsonl"] 
                pred_file_path="test-results.txt"
    

    Note the following:

    • model.tensor_model_parallel_size should be set to 2 for the 5B GPT model (nemo_gpt5B_fp16_tp2.nemo) or 4 for the 20B GPT-3 model 
    • trainer.devices should be set to equal the TP value (above)
    • pred_file_path is the file where test results will be recorded, one line per test sample

    Get started customizing your language model

    This post walked through the language model customization process using NVIDIA NeMo. From a single public checkpoint, these models can be adapted to numerous NLP applications through a parameter-efficient, compute-efficient process. 

    Visit NVIDIA/NeMo on GitHub to get started with LLM customization. You can also sign up for the NVIDIA NeMo service early access program.

    Register for GTC 2023 for free and join us March 20–23 for How to Build Generative AI for Enterprise Use Cases and targeted session tracks for generative AI models for science and art, autonomous vehicles, content creation, and much more. 

    Source:: NVIDIA