The worlds of AI and HPC have an insatiable appetite for more and more performance. With the rise of GPUs being used to run a lot of these AI frameworks, it’s only fitting that we put the fastest system in the world to the test. One of NVIDIA’s DGX-2 servers arrived onsite recently, and our engineers have integrated this with our internal vScaler lab facility.
The NVIDIA DGX-2 Server
The DGX-2 server builds on the success of the DGX-1 server and increases and improves pretty much everything to create a 2Petaflop (tensor ops) monster of a system. Some of the hardware highlights include:
- 16x V100 32GB GPUs (That’s half a TB of GPU HBM2 memory space when used with
the CUDA unified memory and cudaMallocManaged() - 12x NVSwitch switches providing a non-blocking GPU fabric with 2.4TB/s bisection
bandwidth. - 800GB of a network trunk to get data in and out (seems overkills for just ssh!)
- 30TB of local NVMe SSD to keep those GPUs busy
Another tip of the hat needs to go to the NVIDIA GPU Cloud (https://ngc.nvidia.com) as the number of containers/applications/frameworks that are available on this platform is growing daily. Optimised containers across Deep Learning, AI and HPC are readily available and we used the Tensorflow container from this platform for the benchmarking exercise:
$ docker pull nvrc.io/nvidia/tensorflow:18.10-py3
Gone are the days of manually building libraries, matching python versions, source-code hacking and praying!

Integration with vScaler
vScaler integration was seamless – we had a preconfigured image that we’ve been using for our DeepOps integration (https://github.com/NVIDIA/deepops) and we flashed the system with that (bare metal provision, not virtualised). This provided us with all the tools needed to access the NVIDIA GPU Cloud container repository along with Kubernetes and other optimisation options, all based on Ubuntu Bionic 18.04 LTS.
Benchmark Setup
All benchmarks were run using nvidia-docker, making use of the latest TensorFlow container provided by NVIDIA GPU Cloud (nvidia/tensorflow:18.10-py3), with the imagenet synthetic dataset (provided as part of the tf_cnn_benchmarks).
The benchmark script used was obtained from https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks and we performed a sweep of batch sizes across the tests. All tests were run a number of times and the numbers reported were averaged.
TensorFlow Benchmarking for ResNet Models
To assess the performance of the system we employed the commonly used ResNet Model which is used as a baseline for assessing training and inference performance. ResNet is shorthand for Residual Network and as the name suggests, it relies on Residual Learning (which tries to solve the challenges with training Deep Neural Networks). Such challenges include increased difficulty to train as we go deeper, as well as accuracy saturation and degradation. We selected two common models: ResNet-50 ResNet-152 (Where ResNet50 is a 50 layer Residual Network, and 152 is… well, you’ve guessed it!)
ResNet was introduced in 2015 and was the winner of ILSVRC (Large Scale Visual Recognition Challenge 2015 in image classification, detection, and localisation. There are of course many other Convolutional Neural Network (CNN) architecture models we could have chosen from and in time we hope to evaluate these also. See table 1 for a brief history of the ILSVRC competition CNN architecture models.
YEAR | CNN | DEVELOPED BY | ERROR RATES | #PARAMETERS |
1998 | LeNet | Yann LeCun et al | 60 thousand | |
2012 | AlexNet | Alex Krizhevsky, Geoffrey Hinton, Illya Sutskever | 13.3% | 60 million |
2013 | ZFNet | Matthew Zeiler, Rob Fergus | 14.8% | |
2014 | GoogLeNet | 6.67% | 4 million | |
2014 | VGGNet | Simonyan, Zisserman | 7.3% | 138 million |
2015 | ResNet | Kaiming He | 3.6% |
Table 1 : ILSVRC competition CNN architecture models. Ref: https://medium.com/@RaghavPrabhu/cnn-architectures-lenet-alexnet-vgg-googlenetand-resnet-7c81c017b848
Each model was run using various batch sizes to ensure that each GPU was fully utilised, demanding the highest level of performance from the system. Each combination of batch size and GPU count was tested 3 times over 20 epochs and the average result recorded. Results below show the images processed per second during the network training phase.
Training Command: python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=${BATCH_SIZE} --model=${MODEL} --optimizer=momentum --variable_update=replicated --nodistortions -- gradient_repacking=8 --num_gpus=${NUM_GPUS} --num_epochs=10 --weight_decay=1e-4 --data_dir=/workspace/data --use_fp16 \ --train_dir=${CKPT_DIR}
Inference Command: python tf_cnn_benchmarks.py --forward_only=True --batch_size=${BATCH_SIZE} --model=${MODEL} --num_epochs=10 --optimizer=momentum --distortions=True --display_every 10 -- num_gpus=${NUM_GPUS} --data_dir=./test_data/fake_tf_record_data/ --data_name=imagenet
Comparing the average images per second of each model for a fixed batch size and varying GPU count shows the near linear performance increase for each GPU added. For example, when running ResNet-50 with a batch size of 256, going from 1 GPU to 16 GPUs results in a scaling factor of 13.9 (which represent an 86% efficiency in scaling). We’re confident that with some tweaking we can improve this further.




During the tests we monitored the system power draw through the onboard sensors and captured data points using ipmitool. Below is a chart of the power draw over time, as the tests iterated through the models, batch sizes and number of GPUs.
Thanks to the tech team in the lab for their work on this – stay tuned for more results on the NVIDIA DGX-2 server as we look at other applications and workloads. For more information on the NVIDIA™ DGX-2 Server visit www.nvidia.com/en-us/datacenter/dgx-2.
Have you something interesting to run on the DGX-2 Server?
Request a callback: