CUDA 4.0 MultiGPU on an Amazon EC2 instance 9

Posted by kashif on June 23, 2011

This post will take you through starting and configuring an Amazon EC2 instance to use the Multi GPU features of CUDA 4.0.

Motivation

CUDA 4.0 comes with some new exciting features such as:

  • the ability to share GPUs across multiple threads;
  • or use all GPUs in the system concurrently from a single host thread;
  • and unified virtual addressing for faster multi GPU programming;

and many more.

The ability to access all the GPUs in a system is particularly nice on Amazon, since the large GPU enabled instances come with two Tesla M2050 Fermi boards, each capable of 1030 GFlops theoretical peak performance with 448 cores and 3GB of memory.

Getting started

Signing up to Amazon’s AWS is easy enough with a Credit Card, and once you are logged in, go to the EC2 tab of your console which should look something like this:

The EC2 console page

The EC2 console page.

Now press the Launch Instance button and in the Community AMIs tab set the Viewing option to Amazon Images and search for gpu and Select the CentOS 5.5 GPU HVM AMI and press Continue:

Choose an AMI

Choose the CentOS 5.5 GPU HVM AMI (bottom one).

Next we need to select the Instance Type and its important here to select the Cluster GPU type, and then press Continue:

Instance type

Select the Cluster GPU Instance Type.

Next we need to Create a New Key Pair, by giving it a name like amazon-gpu and press Create & Download your Key Pair to download it to your local computer as a file called amazon-gpu.pem:

Create Key Pair

Create and download Key Pair.

We press Continue to go to the Firewall setting. Here we Create a new Security Group, give it a name and description, and then Create a new rule for ssh so that we can log into our instance once its up and running, and press Continue:

Security Group

Create a new Security Group and a new ssh rule.

And finally we can review our settings and Launch it:

Review and Launch

Review and Launch instance.

Back in our EC2 console we can go to our Instances and see our new AMI’s Status. It should be booting or running, rather than stopped as in the case below:

AMI Instance

AMI Instance's Status and Description.

The Description tab will also contain the Public DNS which we can use together with the Key Pair we downloaded locally to ssh into our instance:

$ chmod 400 amazon-gpu.pem
$ ssh root@ec2-50-16-170-159.compute-1.amazonaws.com -i amazon-gpu.pem

__| __|_ ) CentOS
_| ( / v5.5
___|\___|___| HVMx64 GPU

Welcome to an EC2 Public Image
Please view /root/README
:-)

 

[root@ip-10-16-7-119 ~]#

Updating CUDA to 4.0

Now we need to update the CUDA driver and toolkit on our instance, so the first thing we do is to update the Linux Kernel and reboot the instance via the web console:

[root@ip-10-16-7-119 ~]# yum update kernel kernel-devel kernel-headers
Loaded plugins: fastestmirror
Determining fastest mirrors
* addons: mirror.cogentco.com
* base: mirror.umoss.org
* extras: mirror.symnds.com
* updates: mirror.umoss.org
addons | 951 B 00:00
base | 2.1 kB 00:00
base/primary_db | 2.2 MB 00:00
extras | 2.1 kB 00:00
extras/primary_db | 260 kB 00:00
updates | 1.9 kB 00:00
updates/primary_db | 635 kB 00:00
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:2.6.18-238.12.1.el5 set to be installed
---> Package kernel-devel.x86_64 0:2.6.18-238.12.1.el5 set to be installed
---> Package kernel-headers.x86_64 0:2.6.18-238.12.1.el5 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
kernel x86_64 2.6.18-238.12.1.el5 updates 19 M
kernel-devel x86_64 2.6.18-238.12.1.el5 updates 5.5 M
Updating:
kernel-headers x86_64 2.6.18-238.12.1.el5 updates 1.2 M

Transaction Summary
================================================================================
Install 2 Package(s)
Upgrade 1 Package(s)

Total download size: 26 M
Is this ok [y/N]: y
Downloading Packages:
(1/3): kernel-headers-2.6.18-238.12.1.el5.x86_64.rpm | 1.2 MB 00:00
(2/3): kernel-devel-2.6.18-238.12.1.el5.x86_64.rpm | 5.5 MB 00:00
(3/3): kernel-2.6.18-238.12.1.el5.x86_64.rpm | 19 MB 00:00
--------------------------------------------------------------------------------
Total 18 MB/s | 26 MB 00:01
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : kernel-devel 1/4
Installing : kernel 2/4
Updating : kernel-headers 3/4
Cleanup : kernel-headers 4/4

Installed:
kernel.x86_64 0:2.6.18-238.12.1.el5 kernel-devel.x86_64 0:2.6.18-238.12.1.el5

Updated:
kernel-headers.x86_64 0:2.6.18-238.12.1.el5

 

Complete!

I leave it as an exercise to figure out how to reboot the instance from the console, but once its back up and running, we can ssh back into it to download and install the CUDA 4.0 drivers, toolkit and SDK. For example:

[root@ip-10-16-7-119 ~]# wget http://developer.download.nvidia.com/compute/cuda
/4_0/toolkit/cudatoolkit_4.0.17_linux_64_rhel5.5.run
--2011-06-23 04:47:05-- http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/cudatoolkit_4.0.17_linux_64_rhel5.5.run
Resolving developer.download.nvidia.com... 168.143.242.144, 168.143.242.203
Connecting to developer.download.nvidia.com|168.143.242.144|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 212338897 (203M) [application/octet-stream]
Saving to: `cudatoolkit_4.0.17_linux_64_rhel5.5.run'

100%[======================================>] 212,338,897 33.2M/s in 6.3s

2011-06-23 04:47:12 (32.0 MB/s) - `cudatoolkit_4.0.17_linux_64_rhel5.5.run' saved [212338897/212338897]

 

[root@ip-10-16-7-119 ~]# chmod +x cudatoolkit_4.0.17_linux_64_rhel5.5.run
[root@ip-10-16-7-119 ~]# ./cudatoolkit_4.0.17_linux_64_rhel5.5.run

will install the CUDA toolkit. Similarly install the drivers and SDK and finally check everything is working by typing:

[root@ip-10-16-7-119 ~]# nvidia-smi -a -q

==============NVSMI LOG==============

Timestamp : Thu Jun 23 04:46:42 2011

Driver Version : 270.41.19

Attached GPUs : 2

GPU 0:0:3
Product Name : Tesla M2050
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
...
GPU 0:0:4
....

MultiGPU example

Once CUDA 4.0 is installed and working, we can test out the MultiGPU example that comes with the SDK installed earlier. Firstly we will need to install the C++ compiler:

[root@ip-10-16-7-119 simpleMultiGPU]# yum install gcc-c++

and then we need to set our LD_LIBRARY_PATH to include the CUDA libraries:

[root@ip-10-16-7-119 release]# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/lib

After that, we can go to the NVIDIA_GPU_Computing_SDK/C/ folder and type make. The binaries will be installed in the NVIDIA_GPU_Computing_SDK/C/bin/linux/release/ directory and if we go there, we can run the simpleMultiGPU example:

[root@ip-10-16-7-119 release]# ./simpleMultiGPU
[simpleMultiGPU] starting...
CUDA-capable device count: 2
Generating input data...

Computing with 2 GPU's...
GPU Processing time: 24.472000 (ms)

Computing with Host CPU...

Comparing GPU and Host CPU results...
GPU sum: 16777280.000000
CPU sum: 16777294.395033
Relative difference: 8.580068E-07

[simpleMultiGPU] test results...
PASSED

Press ENTER to exit...

MultiGPU Cluster Setup

Using the above setup and this video, it is also possible to configure an 8 node cluster of GPU instances as described here for high performance computing applications. I will try to do a MultiGPU and Open MPI example in another blog post so stay tuned.

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Robert Oschler Fri, 24 Jun 2011 12:40:55 UTC

    Great article. What does an 8-node cluster of GPU instances cost to run per hour on EC2 and how does a single GPU core in this example compare to in core count, for example, a single chip Fermi board? Also, do you know of any CUDA cloud providers that provided Windows instances instead of CentOS?

  2. kashif Fri, 24 Jun 2011 13:55:40 UTC

    Thanks Robert. To answer your questions:

    1. So this single node cost me $2.10 per hour. I suspect for an 8 node GPU cluster the price will be more than $16.80 per hour not counting the price for storage and network access etc. Check here: Amazon EC2 Pricing.

    2. Not sure if I understand the question, the example above is using all the 448*2 cores of the two GPU’s on the instance, and its the programmer’s job to keep these cores “feed” by balancing the latency and computational requirements of the algorithm, but all that is done in the code when we call the CUDA kernel. I suspect the Fermi chips on these boards are the same one would get on a Tesla or GeForce board, but I will need to check that… it might be that Nvidia sorts out the GPU’s according to quality of working cores and the low quality chips end up on Geforce cards and the higher quality ones on their Quadro or Tesla line… not sure though. Is that what you wanted to know?

    3. As far as I know, the Azure platform does not have GPU support yet. One could perhaps try to provision a Windows 64 bit HVM AMI (if such an AMI exists) on the Cluster GPU instances on Amazon and try to install the CUDA drivers etc., and then copy over your CUDA binary and run it… but I think thats not possible either. Perhaps you can also check out Hoopoe.

    Hope this answers your questions.

  3. Robert Oschler Tue, 28 Jun 2011 02:59:27 UTC

    Thanks Kashif, you answered all my questions. Nearly 1,000 cores per box, that’s really fantastic.

    Has anybody tried using PyCUDA in this scenario? If not, do you think it would work?

  4. kashif Tue, 28 Jun 2011 22:09:55 UTC

    Yes pycuda (or pyopencl for that matter) should work quite well also, just a matter of compiling it up on the instances.

  5. Robert Oschler Wed, 29 Jun 2011 09:44:55 UTC

    Thanks kashif. I’ll be watching your blog to see if you do a PyCUDA post. :)

  6. Michael Sorhaindo Wed, 16 Nov 2011 02:51:37 UTC

    Thanks Kashif! I’m going to try this out tomorrow!

  7. Michael Sorhaindo Wed, 16 Nov 2011 03:47:08 UTC

    I couldn’t sleep so I logged in to amazon and

    ==============NVSMI LOG==============

    Timestamp : Tue Nov 15 17:45:08 2011

    Driver Version : 270.41.19

    Attached GPUs : 2

    GPU 0:0:3
    Product Name : Tesla M2050
    Display Mode : Disabled
    Persistence Mode : Disabled
    Driver Model
    Current : N/A
    Pending : N/A
    Serial Number : 0322510018511

    …ect…

    Thanks Kashif, awesome article!!

  8. Michael Sorhaindo Wed, 16 Nov 2011 04:00:05 UTC

    I even got the examples built and running, although there was no simpleMultiGPU example in the SDK I built, but another similar program proved multiple CPU’s worked. Anyways I’m off to bed, enough spamming your blog. Thanks and goodnight.

  9. kashif Wed, 16 Nov 2011 04:19:25 UTC

    Nice! glad you got it working. I’m working on a GPU Cluster tutorial. Will post it soon.

Comments