CategoriesNews Tech Articles

The worlds of AI and HPC have an insatiable appetite for more and more performance. With the rise of GPUs being used to run a lot of these AI frameworks, it’s only fitting that we put the fastest system in the world to the test. One of NVIDIA’s DGX-2 servers arrived onsite recently, and our engineers have integrated this with our internal vScaler lab facility.

The NVIDIA DGX-2 Server

NVIDIA LogoThe DGX-2 server builds on the success of the DGX-1 server and increases and improves pretty much everything to create a 2Petaflop (tensor ops) monster of a system. Some of the hardware highlights include:

  • 16x V100 32GB GPUs (That’s half a TB of GPU HBM2 memory space when used with
    the CUDA unified memory and cudaMallocManaged()
  • 12x NVSwitch switches providing a non-blocking GPU fabric with 2.4TB/s bisection
  • 800GB of a network trunk to get data in and out (seems overkills for just ssh!)
  • 30TB of local NVMe SSD to keep those GPUs busy

Another tip of the hat needs to go to the NVIDIA GPU Cloud ( as the number of containers/applications/frameworks that are available on this platform is growing daily. Optimised containers across Deep Learning, AI and HPC are readily available and we used the Tensorflow container from this platform for the benchmarking exercise:

$ docker pull

Gone are the days of manually building libraries, matching python versions, source-code hacking and praying!

NVIDIA DGX-2 Server Components

Integration with vScaler

vScaler integration was seamless – we had a preconfigured image that we’ve been using for our DeepOps integration ( and we flashed the system with that (bare metal provision, not virtualised). This provided us with all the tools needed to access the NVIDIA GPU Cloud container repository along with Kubernetes and other optimisation options, all based on Ubuntu Bionic 18.04 LTS.

Benchmark Setup

All benchmarks were run using nvidia-docker, making use of the latest TensorFlow container provided by NVIDIA GPU Cloud (nvidia/tensorflow:18.10-py3), with the imagenet synthetic dataset (provided as part of the tf_cnn_benchmarks).

The benchmark script used was obtained from and we performed a sweep of batch sizes across the tests. All tests were run a number of times and the numbers reported were averaged.

TensorFlow Benchmarking for ResNet Models

To assess the performance of the system we employed the commonly used ResNet Model which is used as a baseline for assessing training and inference performance. ResNet is shorthand for Residual Network and as the name suggests, it relies on Residual Learning (which tries to solve the challenges with training Deep Neural Networks). Such challenges include increased difficulty to train as we go deeper, as well as accuracy saturation and degradation. We selected two common models: ResNet-50 ResNet-152 (Where ResNet50 is a 50 layer Residual Network, and 152 is… well, you’ve guessed it!)

ResNet was introduced in 2015 and was the winner of ILSVRC (Large Scale Visual Recognition Challenge 2015 in image classification, detection, and localisation. There are of course many other Convolutional Neural Network (CNN) architecture models we could have chosen from and in time we hope to evaluate these also. See table 1 for a brief history of the ILSVRC competition CNN architecture models.

1998 LeNet Yann LeCun et al 60 thousand
2012 AlexNet Alex Krizhevsky, Geoffrey Hinton, Illya Sutskever 13.3% 60 million
2013 ZFNet Matthew Zeiler, Rob Fergus 14.8%
2014 GoogLeNet Google 6.67% 4 million
2014 VGGNet Simonyan, Zisserman 7.3% 138 million
2015 ResNet Kaiming He 3.6%

Table 1 : ILSVRC competition CNN architecture models. Ref:

Each model was run using various batch sizes to ensure that each GPU was fully utilised, demanding the highest level of performance from the system. Each combination of batch size and GPU count was tested 3 times over 20 epochs and the average result recorded. Results below show the images processed per second during the network training phase.

Training Command:
python --data_format=NCHW --batch_size=${BATCH_SIZE}
--model=${MODEL} --optimizer=momentum --variable_update=replicated --nodistortions --
gradient_repacking=8 --num_gpus=${NUM_GPUS} --num_epochs=10 --weight_decay=1e-4 --data_dir=/workspace/data --use_fp16 \
Inference Command:
python --forward_only=True --batch_size=${BATCH_SIZE} --model=${MODEL}
--num_epochs=10 --optimizer=momentum --distortions=True --display_every 10 --
num_gpus=${NUM_GPUS} --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet

Comparing the average images per second of each model for a fixed batch size and varying GPU count shows the near linear performance increase for each GPU added. For example, when running ResNet-50 with a batch size of 256, going from 1 GPU to 16 GPUs results in a scaling factor of 13.9 (which represent an 86% efficiency in scaling). We’re confident that with some tweaking we can improve this further.

ResNet 50 Training
ResNet 152 Training
ResNet 50 Inference
ResNet 152 Inference

During the tests we monitored the system power draw through the onboard sensors and captured data points using ipmitool. Below is a chart of the power draw over time, as the tests iterated through the models, batch sizes and number of GPUs.

Power draw during benchmarking

Thanks to the tech team in the lab for their work on this – stay tuned for more results on the NVIDIA DGX-2 server as we look at other applications and workloads. For more information on the NVIDIA™ DGX-2 Server visit

Have you something interesting to run on the DGX-2 Server?

Request a callback:

Leave a Reply

Your email address will not be published. Required fields are marked *