Unlocking Multi-GPU Model Training with Dask XGBoost

A diagram representing multi-GPU training.

As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient…

As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient gradient-boosting framework that’s been widely adopted due to its speed and performance for large tabular data. 

Using multiple GPUs should theoretically provide a significant boost in computational power, resulting in faster model training. Yet, many users have found it challenging when attempting to leverage this power through Dask XGBoost. Dask is a flexible open-source Python library for parallel computing and XGBoost provides Dask APIs to train CPU or GPU Dask DataFrames.

A common hurdle of training Dask XGBoost is handling out of memory (OOM) errors at different stages, including

  • Loading the training data
  • Converting the DataFrame into XGBoost’s DMatrix format
  • During the actual model training

Addressing these memory issues can be challenging, but very rewarding because the potential benefits of multi-GPU training are enticing.

Top takeaways

This post explores how you can optimize Dask XGBoost on multiple GPUs and manage memory errors. Training XGBoost on large datasets presents a variety of challenges. I use the Otto Group Product Classification Challenge dataset to demonstrate the OOM problem and how to fix it. The dataset has 180 million rows and 152 columns, totaling 110 GB when loaded into memory.

The key issues we tackle include: 

  • Installation using the latest version of RAPIDS and the correct version of XGBoost.
  • Setting environment variables.
  • Dealing with OOM errors.
  • Utilizing UCX-py for more speedup.

Be sure to follow along with the accompanying Notebooks for each section.

Prerequisites

An initial step in leveraging the power of RAPIDS for multi-GPU training is the correct installation of RAPIDS libraries. It’s critical to note that there are several ways to install these libraries—pip, conda, docker, and building from source, each compatible with Linux and Windows Subsystem for Linux. 

Each method has unique considerations. For this guide, I recommend using Mamba, while adhering to the conda install instructions. Mamba provides similar functionalities as conda but is much faster, especially for dependency resolution. Specifically, I opted for a fresh installation of mamba. 

Install the latest RAPIDS version

As a best practice, always install the latest RAPIDS libraries available to use the latest features. You can find up-to-date install instructions in the RAPIDS Installation Guide. 

This post uses version 23.04, which can be installed with the following command:

mamba create -n rapids-23.04 -c rapidsai -c conda-forge -c nvidia  

    rapids=23.04 python=3.10 cudatoolkit=11.8

This instruction installs all the libraries required including Dask, Dask-cuDF, XGBoost, and more. In particular, you’ll want to check the XGBoost library installed using the command:

mamba list xgboost

The output is listed in Table 1:

NameVersionBuildChannelXGBoost1.7.1dev.rapidsai23.04cuda_11_py310_3rapidsai-nightlyTabel 1. Install the correct XGBoost whose channel should be rapidsai or rapidsai-nightly

Avoid manual updates for XGBoost

Some users might notice that the version of XGBoost is not the latest, which is 1.7.5. Manually updating or installing XGBoost using pip or conda-forge is‌ problematic when training XGBoost together with UCX. 

The error message will read something like the following:
Exception: “XGBoostError(‘[14:14:27] /opt/conda/conda-bld/work/rabit/include/rabit/internal/utils.h:86: Allreduce failed’)”

Instead, use the XGBoost installed from RAPIDS. A quick way to verify the correctness of the XGBoost version is mamba list xgboost and check the “channel” of the xgboost, which should be “rapidsai” or “rapidsai-nightly”.  

XGBoost in the rapidsai channel is built with the RMM plug-in enabled and delivers the best performance regarding multi-GPU training.

Multi-GPU training walkthrough 

First, I’ll walk through a multi-GPU training notebook for the Otto dataset and cover the steps to make it work. Later on, we will talk about some advanced optimizations including UCX and spilling. 

You can also find the XGB-186-CLICKS-DASK Notebook on GitHub. Alternatively, we provide a python script with full command line configurability.

The main libraries we are going to use are xgboost, dask, dask_cuda, and dask-cudf. 

import os

import dask

import dask_cudf

import xgboost as xgb

from dask.distributed import Client

from dask_cuda import LocalCUDACluster

Environment set up

First, let’s set our environment variables to make sure our GPUs are visible. This example uses eight GPUs with 32 GB of memory on each GPU, which is the minimum requirement to run this notebook without OOM complications. In Section Enable memory spilling below we will discuss techniques to lower this requirement to 4 GPUs.

GPUs = ','.join([str(i) for i in range(0,8)])
os.environ['CUDA_VISIBLE_DEVICES'] = GPUs

Next, define a helper function to create a local GPU cluster for a mutli-GPU single node.

def get_cluster():

    cluster = LocalCUDACluster()

    client = Client(cluster)

    return client

Then, create a Dask client for your computations.

client = get_cluster()

Loading data

Now, let’s load the Otto dataset. Use dask_cudf read_parquet function, which uses multiple GPUs to read the parquet files into a dask_cudf.DataFrame.

users = dask_cudf.read_parquet('/raid/otto/Otto-Comp/pqs/train_v152_*.pq').persist()

The dataset consists of 152 columns that represent engineered features, providing information about the frequency with which specific product pairs are viewed or purchased together. The objective is to predict which product the user will click next based on their browsing history. The details of this dataset can be found in this writeup.

Even at this early stage, out of memory errors can occur. This issue often arises due to excessively large row groups in ‌parquet files. To resolve this, we recommend rewriting the parquet files with smaller row groups. For a more in-depth explanation, refer to the Parquet Large Row Group Demo Notebook.

After loading the data, we can check its shape and memory usage.

users.shape[0].compute()

users.memory_usage().sum().compute()/2**30

The ‘clicks’ column is our target, which means if the recommended item was clicked by the user. We ignore the ID columns and use the rest columns as features.

FEATURES = users.columns[2:]

TARS = ['clicks']

FEATURES = [f for f in FEATURES if f not in TARS]

Next, we create a DaskQuantileDMatrix which is the input data format for training xgboost models. DaskQuantileDMatrix is a drop-in replacement for the DaskDMatrix when the histogram tree method is used. It helps reduce overall memory usage. 

This step is critical to avoid OOM errors. If we use the DaskDMatrix OOM occurs even with 16 GPUs. In contrast, DaskQuantileDMatrix enables training xgboot with eight GPUs or less without OOM errors.

dtrain = xgb.dask.DaskQuantileDMatrix(client, users[FEATURES], users['clicks'])

XGBoost model training

We then set our XGBoost model parameters and start the training process. Given the target column ‘clicks’ is binary, we use the binary classification objective. 

xgb_parms = { 

    'max_depth':4, 

    'learning_rate':0.1, 

    'subsample':0.7,

    'colsample_bytree':0.5, 

    'eval_metric':'map',

    'objective':'binary:logistic',

    'scale_pos_weight':8,

    'tree_method':'gpu_hist',

    'random_state':42

}

Now, you’re ready to train the XGBoost model using all eight GPUs.  

Output:

[99] train-map:0.20168

CPU times: user 7.45 s, sys: 1.93 s, total: 9.38 s

Wall time: 1min 10s

That’s it! You’re done with training the XGBoost model using multiple GPUs.

Enable memory spilling

In the previous XGB-186-CLICKS-DASK Notebook, training the XGBoost model on the Otto dataset required a minimum of eight GPUs. Given that this dataset occupies 110GB in memory, and each V100 GPU offers 32GB, the data-to-GPU-memory ratio amounts to a mere 43% (calculated as 110/(32*8)). 

Optimally, we’d halve this by using just four GPUs. Yet, a straightforward reduction of GPUs in our previous setup invariably leads to OOM errors. This issue arises from the creation of temporary variables needed to generate the DaskQuantileDMatrix from the Dask cuDF dataframe and in other steps of training XGBoost. These variables themselves consume a substantial share of the GPU memory. 

Optimize the same GPU resources to train larger datasets

In the XGB-186-CLICKS-DASK-SPILL Notebook, I introduce minor tweaks to the previous setup. By enabling spilling, you can now train on the same dataset using just four GPUs. This technique allows you to train much larger data with the same GPU resources.

Spilling is the technique that moves data automatically when an operation that would otherwise succeed runs out of memory due to other dataframes or series taking up needed space in GPU memory. It enables out-of-core computations on datasets that don’t fit into memory. RAPIDS cuDF and dask-cudf now support spilling from GPU to CPU memory 

Enabling spilling is surprisingly easy, where we just need to reconfigure the cluster with two new parameters, device_memory_limit and jit_unspill:

def get_cluster():

    ip = get_ip()

    cluster = LocalCUDACluster(ip=ip, 

                               device_memory_limit='10GB',

                               jit_unspill=True)

    client = Client(cluster)

    return client

device_memory_limit='10GB’ sets a limit on the amount of GPU memory that can be used by each GPU before spilling is triggered. Our configuration intentionally assigns a device_memory_limit of 10GB, substantially less than the total 32GB of the GPU. This is a deliberate strategy designed to preempt OOM errors during XGBoost training. 

It’s also important to understand that memory usage by XGBoost isn’t managed directly by Dask-CUDA or Dask-cuDF. Therefore, to prevent memory overflow, Dask-CUDA and Dask-cuDF need to initiate the spilling process before the memory limit is reached by XGBoost operations.

Jit_unspill enables Just-In-Time un-spilling, which means that the cluster will automatically spill data from GPU memory to main memory when GPU memory is running low, and unspill it back just in time for a computation.

And that’s it! The rest of the notebook is identical to the previous notebook. Now it can train with just four GPUs, saving 50% of computing resources.

Refer to the XGB-186-CLICKS-DASK-SPILL Notebook for details.

Use Unified Communication X (UCX) for optimal data transfer

UCX-py is a high-performance communication protocol that provides optimized data transfer capabilities, which is particularly useful for GPU-to-GPU communication.

To use UCX effectively, we need to set another environment variable RAPIDS_NO_INITIALIZE

os.environ["RAPIDS_NO_INITIALIZE"] = "1"

It stops cuDF from running various diagnostics on import which requires the creation of an NVIDIA CUDA context. When running distributed and using UCX, we have to bring up the networking stack before a CUDA context is created (for various reasons). By setting that environment variable, any child processes that import cuDF do not create a CUDA context before UCX has a chance to do so. 

Reconfigure the cluster:

def get_cluster():

    ip = get_ip()

    cluster = LocalCUDACluster(ip=ip, 

                               device_memory_limit='10GB',

                               jit_unspill=True,

                               protocol="ucx", 

                                 rmm_pool_size="29GB"

)

    client = Client(cluster)

    return client

The protocol=’ucx’ parameter specifies UCX to be the communication protocol used for transferring data between the workers in the cluster.

Use the prmm_pool_size=’29GB’ parameter to set the size of the RAPIDS Memory Manager (RMM) pool for each worker. RMM allows for efficient use of GPU memory. In this case, the pool size is set to 29GB which is less than the total GPU memory size of 32GB. This adjustment is crucial as it accounts for the fact that XGBoost creates certain intermediate variables that exist outside the control of the RMM pool.

By simply enabling UCX, we experienced a substantial acceleration in our training times—a significant speed boost of 20% with spilling, and an impressive 40.7% speedup when spilling was not needed. Refer to the XGB-186-CLICKS-DASK-UCX-SPILL Notebook for details.

Configure local_directory

There are times when warning messages emerge, such as, “UserWarning: Creating scratch directories is taking a surprisingly long time.” This is a signal indicating that ‌disk performance is becoming a bottleneck. 

To circumvent this issue, we could set local_directory of dask-cuda, which specifies the path on the local machine to store temporary files. These temporary files are used during Dask’s spill-to-disk operations. 

A recommended practice is to set the local_directory to a location on a fast storage device. For instance, we could set local_directory to /raid/dask_dir if it is on a high-speed local SSD. Making this simple change can significantly reduce the time it takes for scratch directory operations, optimizing your overall workflow. 

The final cluster configuration is as follows:

def get_cluster():

    ip = get_ip()

    cluster = LocalCUDACluster(ip=ip, 

                               local_directory=’/raid/dask_dir’               

                               device_memory_limit='10GB',

                               jit_unspill=True,

                               protocol="ucx", 

                                 rmm_pool_size="29GB"

)

    client = Client(cluster)

    return client

Results

As shown in Table 2, the two main optimization techniques are UCX and spilling. We managed to train XGBoost with just four GPUs and 128GB of memory. We will also show the performance scales nicely to more GPUs.

Spilling offSpilling onUCX off135s / 8GPUs / 256 GB270s / 4GPUs / 128 GBUCX on80s / 8GPUs /  256 GB 217s / 4GPUs / 128 GBTable 2. Overview of four combinations of optimizations

In each cell, the numbers represent end-to-end execution time, the minimum number of GPUs required, and the total GPU memory available. All four demos accomplish the same task of loading and training 110 GB of Otto data.

Summary

In conclusion, leveraging Dask and XGBoost with multiple GPUs can be an exciting adventure, despite the occasional bumps like out of memory errors.

You can mitigate these memory challenges and tap into the potential of multi-GPU model training by:

  • Carefully configuring parameters such as row group size in the input parquet files
  • Ensuring the correct installation of RAPIDS and XGBoost
  • Utilizing Dask Quantile DMatrix
  • Enabling spilling

Furthermore, by applying advanced features such as UCX-Py, you can significantly speed up training times.

imgSign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials, events, and more in your inbox once a month from NVIDIA.

Source:: NVIDIA