Site icon GIXtools

Increasing Inference Acceleration of KoGPT with NVIDIA FasterTransformer

Transformers are one of the most influential AI model architectures today and are shaping the direction of future AI R&D. First invented as a tool for…

Transformers are one of the most influential AI model architectures today and are shaping the direction of future AI R&D. First invented as a tool for natural language processing (NLP), transformers are now used in almost every AI task, including computer vision, automatic speech recognition, molecular structure classification, and financial data processing.

In Korea, KakaoBrain has developed a highly accurate large language model (LLM), KoGPT, which is based on the transformer architecture. It trained on a large Korean dataset and successfully optimized it using NVIDIA FasterTransformer.

In this post, we cover how NVIDIA and KakaoBrain have optimized KoGPT with FasterTransformer.

Introduction to FasterTransformer

Diagram shows FasterTransformer’s ability to complete NLP tasks, including classification, generation, text summarization, and sentiment analysis supporting multi-GPU and multi-node GPU.Figure 1. FasterTransformer used in NLP tasks

The transformer layer is currently the most widely used deep learning architecture in the field of deep learning. It originated in NLP and is now expanding its applications beyond language to vision, speech, and generative AI.

One of the most popular AI models in NLP is the family of GPT models. GPTs are LLMs developed by OpenAI that stack multiple layers of so-called decoder blocks from the transformer model architecture. For example, GPT-3 is an LLM with hundreds of billions of parameters that can agglomerate a huge amount of information like a giant encyclopedia.

However, training these LLMs poses several challenges:

The NVIDIA NeMo framework and FasterTransformer enable faster training and inference for LLMs with hundreds of billions of parameters.

Optimizations in FasterTransformer

FasterTransformer is a library that implements an inference acceleration engine for large transformer models using the model parallelization (tensor parallelism and pipeline parallelism) methods described earlier. FasterTransformer was developed to minimize latency and maximize throughput compared to previously available deep learning frameworks.

FasterTransformer provides the optimized decoder and encoder blocks of the transformer model. It is written in C++/CUDA and has APIs for the TensorFlow, PyTorch, and Triton Backend frameworks. It comes with sample code to demonstrate its main features.

FasterTransformer is open source. It already supports many models, such as GPT-3, GPT-J, GPT-NeoX, BERT, ViT, Swin Transformer, Longformer, T5, XLNet, and BLOOM. Support for new models is constantly being added. It also supports prompt learning techniques. For more information, see the latest support matrix.

As previously stated, FasterTransformer enables a faster inference pipeline with lower latency and higher output than other deep learning frameworks. Here are some of the optimization techniques used in FasterTransformer:

Introduction to KoGPT

Diagram introduces KakaoBrain’s KoGPT, showing versatility and effectiveness in various applications such as chatbots, language translation, and content generation.Figure 2. KoGPT tasks within the Korean language context

KakaoBrain’s KoGPT lexically and contextually understands the Korean language and generates sentences based on user intent. With KoGPT, users can perform various tasks related to the Korean language, such as determining the sentiment of a given sentence, predicting a summary or conclusion, answering a question, or generating the next sentence. It can also handle high-level language tasks such as machine reading, machine translation, writing, and sentiment analysis in various fields.

Performance of KoGPT with FasterTransformer

KakaoBrain wanted to run KoGPT, a GPT-3-based language model serving implemented using HuggingFace. Speed was not an issue when training, as performance was more important. But when it came to serving, speed became a major issue. The slower the inference speed, the more servers that had to be dedicated to the service. It increased exponentially, which lead to higher operational costs.

The team at KakaoBrain tried to find and apply various methods to optimize the inference speed of GPT-3 but could not make much improvement. It was only after adopting NVIDIA FasterTransformer that the team found a significant improvement in the inference speed. The inference speed varies depending on the number of input and output tokens but on average, the team improved the speed of inference up to 400% over the existing speed on one NVIDIA V100 GPU.

FasterTransformer also supports tensor and pipeline parallelism. When using four V100 GPUs, the team made an inference speedup of over 1100% over the baseline.

Bar chart shows the performance of KakaoBrain’s KoGPT model on FasterTransformer in GPUs gaining 4x times faster speed on a single GPU and 11x faster speed on multi-GPU. Figure 3. KoGPT on FasterTransformer shows significant performance gains compared to PyTorch

Summary

NVIDIA FasterTransformer made it easy for KoGPT to overcome the technical challenges the team had to face before launching as a service. The KakaoBrain ML Optimization team improved the total cost of ownership (TCO) by more than 15% by serving more requests on the same hardware.

For more information about KakaoBrain’s Korean LLM, KoGPT, and Korean AI chatbot KoChat GPT, see the Kakao website.

Source:: NVIDIA

Exit mobile version