Nvidia DGX-1

The Stanford Computer Vision Lab has added the Nvidia DGX-1 machine to their computer cluster. Currently, it is only accessible via the SAIL network. Please file a help request at http://support.cs.stanford.edu if you have any questions regarding the use of the machine.


Specification

Hostname visionlab-dgx1.stanford.edu
CPU 2x Intel E5-2698 v4 2.2 GHz @ 20-core
RAM 512GB
GPU 8x Tesla P100
Networking 10GbE
Storage 4x 2TB SSD RAID0, NFS-shared storage

Nvidia-Docker

Nvidia suggests using Nvidia-Docker and their provided containers for optimized performance and convenience. They include popular frameworks such as CUDA, Caffe, Digits, Tensorflow and Torch. Here is how to get started.

#Load the framework
docker load --input /raid/containers/<framework>.tar

#Make sure containers are correctly loaded
docker images

#Test nvidia-smi
nvidia-docker run --rm compute.nvidia.com/nvidia/cuda nvidia-smi

#Launch a framework container
nvidia-docker run --rm -ti compute.nvidia.com/nvidia/<framework>

Example

csid@visionlab-dgx1:~$ nvidia-docker run --rm compute.nvidia.com/nvidia/cuda nvidia-smi
Wed Aug 31 01:51:45 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.77                 Driver Version: 361.77                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off  | 0000:06:00.0     Off |                    0 |
| N/A   34C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off  | 0000:07:00.0     Off |                    0 |
| N/A   36C    P0    32W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  Off  | 0000:0A:00.0     Off |                    0 |
| N/A   36C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  Off  | 0000:0B:00.0     Off |                    0 |
| N/A   37C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-SXM2...  Off  | 0000:85:00.0     Off |                    0 |
| N/A   38C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-SXM2...  Off  | 0000:86:00.0     Off |                    0 |
| N/A   34C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-SXM2...  Off  | 0000:89:00.0     Off |                    0 |
| N/A   36C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-SXM2...  Off  | 0000:8A:00.0     Off |                    0 |
| N/A   37C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

csid@visionlab-dgx1:~$ nvidia-docker run --rm -ti compute.nvidia.com/nvidia/caffe
root@90d82a0a1592:/workspace# caffe device_query -gpu=0
libdc1394 error: Failed to initialize libdc1394
I0831 01:53:44.966660    81 caffe.cpp:118] Querying GPUs 0
I0831 01:53:45.278440    81 common.cpp:193] Device id:                     0
I0831 01:53:45.278481    81 common.cpp:194] Major revision number:         6
I0831 01:53:45.278486    81 common.cpp:195] Minor revision number:         0
I0831 01:53:45.278491    81 common.cpp:196] Name:                          Tesla P100-SXM2-16GB
I0831 01:53:45.278494    81 common.cpp:197] Total global memory:           17071669248
I0831 01:53:45.278501    81 common.cpp:198] Total shared memory per block: 49152
I0831 01:53:45.278506    81 common.cpp:199] Total registers per block:     65536
I0831 01:53:45.278509    81 common.cpp:200] Warp size:                     32
I0831 01:53:45.278513    81 common.cpp:201] Maximum memory pitch:          2147483647
I0831 01:53:45.278520    81 common.cpp:202] Maximum threads per block:     1024
I0831 01:53:45.278524    81 common.cpp:203] Maximum dimension of block:    1024, 1024, 64
I0831 01:53:45.278528    81 common.cpp:206] Maximum dimension of grid:     2147483647, 65535, 65535
I0831 01:53:45.278532    81 common.cpp:209] Clock rate:                    405000
I0831 01:53:45.278537    81 common.cpp:210] Total constant memory:         65536
I0831 01:53:45.278542    81 common.cpp:211] Texture alignment:             512
I0831 01:53:45.278548    81 common.cpp:212] Concurrent copy and execution: Yes
I0831 01:53:45.278578    81 common.cpp:214] Number of multiprocessors:     56
I0831 01:53:45.278584    81 common.cpp:215] Kernel execution timeout:      No