CUDA 4.0 MultiGPU on an Amazon EC2 instance

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

This post will take you through starting and configuring an Amazon EC2 instance to use the Multi GPU features of CUDA 4.0.


CUDA 4.0 comes with some new exciting features such as:

  • the ability to share¬†GPUs across multiple threads;
  • or use all GPUs in the system concurrently from a single host thread;
  • and unified virtual addressing for faster multi GPU programming;

and many more.

The ability to access all the GPUs in a system is particularly nice on Amazon, since the large GPU enabled instances come with two Tesla M2050 Fermi boards, each capable of 1030 GFlops theoretical peak performance with 448 cores and 3GB of memory.

Getting started

Signing up to Amazon’s AWS is easy enough with a Credit Card, and once you are logged in, go to the EC2 tab of your console which should look something like this:

The EC2 console page
The EC2 console page.

Now press the Launch Instance button and in the Community AMIs tab set the Viewing option to Amazon Images and search for gpu and Select the CentOS 5.5 GPU HVM AMI and press Continue:

Choose an AMI
Choose the CentOS 5.5 GPU HVM AMI (bottom one).

Next we need to select the Instance Type and its important here to select the Cluster GPU type, and then press Continue:

Instance type
Select the Cluster GPU Instance Type.

Next we need to Create a New Key Pair, by giving it a name like amazon-gpu and press Create & Download your Key Pair to download it to your local computer as a file called amazon-gpu.pem:

Create Key Pair
Create and download Key Pair.

We press Continue to go to the Firewall setting. Here we Create a new Security Group, give it a name and description, and then Create a new rule for ssh so that we can log into our instance once its up and running, and press Continue:

Security Group
Create a new Security Group and a new ssh rule.

And finally we can review our settings and Launch it:

Review and Launch
Review and Launch instance.

Back in our EC2 console we can go to our Instances and see our new AMI’s Status. It should be booting or running, rather than stopped as in the case below:

AMI Instance
AMI Instance's Status and Description.

The Description tab will also contain the Public DNS which we can use together with the Key Pair we downloaded locally to ssh into our instance:

$ chmod 400 amazon-gpu.pem
$ ssh -i amazon-gpu.pem

__| __|_ ) CentOS
_| ( / v5.5
___|\___|___| HVMx64 GPU

Welcome to an EC2 Public Image
Please view /root/README


[root@ip-10-16-7-119 ~]#

Updating CUDA to 4.0

Now we need to update the CUDA driver and toolkit on our instance, so the first thing we do is to update the Linux Kernel and reboot the instance via the web console:

[root@ip-10-16-7-119 ~]# yum update kernel kernel-devel kernel-headers
Loaded plugins: fastestmirror
Determining fastest mirrors
* addons:
* base:
* extras:
* updates:
addons | 951 B 00:00
base | 2.1 kB 00:00
base/primary_db | 2.2 MB 00:00
extras | 2.1 kB 00:00
extras/primary_db | 260 kB 00:00
updates | 1.9 kB 00:00
updates/primary_db | 635 kB 00:00
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:2.6.18-238.12.1.el5 set to be installed
---> Package kernel-devel.x86_64 0:2.6.18-238.12.1.el5 set to be installed
---> Package kernel-headers.x86_64 0:2.6.18-238.12.1.el5 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

Package Arch Version Repository Size
kernel x86_64 2.6.18-238.12.1.el5 updates 19 M
kernel-devel x86_64 2.6.18-238.12.1.el5 updates 5.5 M
kernel-headers x86_64 2.6.18-238.12.1.el5 updates 1.2 M

Transaction Summary
Install 2 Package(s)
Upgrade 1 Package(s)

Total download size: 26 M
Is this ok [y/N]: y
Downloading Packages:
(1/3): kernel-headers-2.6.18-238.12.1.el5.x86_64.rpm | 1.2 MB 00:00
(2/3): kernel-devel-2.6.18-238.12.1.el5.x86_64.rpm | 5.5 MB 00:00
(3/3): kernel-2.6.18-238.12.1.el5.x86_64.rpm | 19 MB 00:00
Total 18 MB/s | 26 MB 00:01
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : kernel-devel 1/4
Installing : kernel 2/4
Updating : kernel-headers 3/4
Cleanup : kernel-headers 4/4

kernel.x86_64 0:2.6.18-238.12.1.el5 kernel-devel.x86_64 0:2.6.18-238.12.1.el5

kernel-headers.x86_64 0:2.6.18-238.12.1.el5

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.



I leave it as an exercise to figure out how to reboot the instance from the console, but once its back up and running, we can ssh back into it to download and install the CUDA 4.0 drivers, toolkit and SDK. For example:

[root@ip-10-16-7-119 ~]# wget
--2011-06-23 04:47:05--
Connecting to||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 212338897 (203M) [application/octet-stream]
Saving to: `'

100%[======================================>] 212,338,897 33.2M/s in 6.3s

2011-06-23 04:47:12 (32.0 MB/s) - `' saved [212338897/212338897]


[root@ip-10-16-7-119 ~]# chmod +x
[root@ip-10-16-7-119 ~]# ./

will install the CUDA toolkit. Similarly install the drivers and SDK and finally check everything is working by typing:

[root@ip-10-16-7-119 ~]# nvidia-smi -a -q

==============NVSMI LOG==============

Timestamp : Thu Jun 23 04:46:42 2011

Driver Version : 270.41.19

Attached GPUs : 2

GPU 0:0:3
Product Name : Tesla M2050
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
GPU 0:0:4

MultiGPU example

Once CUDA 4.0 is installed and working, we can test out the MultiGPU example that comes with the SDK installed earlier. Firstly we will need to install the C++ compiler:

[root@ip-10-16-7-119 simpleMultiGPU]# yum install gcc-c++

and then we need to set our LD_LIBRARY_PATH to include the CUDA libraries:

[root@ip-10-16-7-119 release]# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/lib

After that, we can go to the NVIDIA_GPU_Computing_SDK/C/ folder and type make. The binaries will be installed in the NVIDIA_GPU_Computing_SDK/C/bin/linux/release/ directory and if we go there, we can run the simpleMultiGPU example:

[root@ip-10-16-7-119 release]# ./simpleMultiGPU
[simpleMultiGPU] starting...
CUDA-capable device count: 2
Generating input data...

Computing with 2 GPU's...
GPU Processing time: 24.472000 (ms)

Computing with Host CPU...

Comparing GPU and Host CPU results...
GPU sum: 16777280.000000
CPU sum: 16777294.395033
Relative difference: 8.580068E-07

[simpleMultiGPU] test results...

Press ENTER to exit...

MultiGPU Cluster Setup

Using the above setup and this video, it is also possible to configure an 8 node cluster of GPU instances as described here for high performance computing applications. I will try to do a MultiGPU and Open MPI example in another blog post so stay tuned.

11 thoughts on “CUDA 4.0 MultiGPU on an Amazon EC2 instance”

  1. Great article. What does an 8-node cluster of GPU instances cost to run per hour on EC2 and how does a single GPU core in this example compare to in core count, for example, a single chip Fermi board? Also, do you know of any CUDA cloud providers that provided Windows instances instead of CentOS?

  2. Thanks Robert. To answer your questions:

    1. So this single node cost me $2.10 per hour. I suspect for an 8 node GPU cluster the price will be more than $16.80 per hour not counting the price for storage and network access etc. Check here: Amazon EC2 Pricing.

    2. Not sure if I understand the question, the example above is using all the 448*2 cores of the two GPU’s on the instance, and its the programmer’s job to keep these cores “feed” by balancing the latency and computational requirements of the algorithm, but all that is done in the code when we call the CUDA kernel. I suspect the Fermi chips on these boards are the same one would get on a Tesla or GeForce board, but I will need to check that… it might be that Nvidia sorts out the GPU’s according to quality of working cores and the low quality chips end up on Geforce cards and the higher quality ones on their Quadro or Tesla line… not sure though. Is that what you wanted to know?

    3. As far as I know, the Azure platform does not have GPU support yet. One could perhaps try to provision a Windows 64 bit HVM AMI (if such an AMI exists) on the Cluster GPU instances on Amazon and try to install the CUDA drivers etc., and then copy over your CUDA binary and run it… but I think thats not possible either. Perhaps you can also check out Hoopoe.

    Hope this answers your questions.

  3. Thanks Kashif, you answered all my questions. Nearly 1,000 cores per box, that’s really fantastic.

    Has anybody tried using PyCUDA in this scenario? If not, do you think it would work?

  4. Yes pycuda (or pyopencl for that matter) should work quite well also, just a matter of compiling it up on the instances.

  5. I couldn’t sleep so I logged in to amazon and

    ==============NVSMI LOG==============

    Timestamp : Tue Nov 15 17:45:08 2011

    Driver Version : 270.41.19

    Attached GPUs : 2

    GPU 0:0:3
    Product Name : Tesla M2050
    Display Mode : Disabled
    Persistence Mode : Disabled
    Driver Model
    Current : N/A
    Pending : N/A
    Serial Number : 0322510018511


    Thanks Kashif, awesome article!!

  6. I even got the examples built and running, although there was no simpleMultiGPU example in the SDK I built, but another similar program proved multiple CPU’s worked. Anyways I’m off to bed, enough spamming your blog. Thanks and goodnight.

  7. I like the helpful information you provide in your articles.
    I will bookmark your blog and check again here frequently.

    I am quite certain I will learn plenty of new stuff right here!
    Best of luck for the next!

Leave a Reply

Your email address will not be published. Required fields are marked *