Mathematica plug-in for CUDA 2

Posted by admin on June 26, 2008

Since there is a Matlab plug-in for CUDA that provides some examples of off-loading computation to the GPU, I thought it might be neat to have something similar for Mathematica. So as a start, I decided to try out a simple scalar product example using MathLink.

The initial template of my function is in the scalarProd.tm file:

double scalarProd P(( double*, long, double*, long));

:Begin:
:Function:      scalarProd
:Pattern:       ScalarProd[ x:{___Real}, y:{___Real} ]
:Arguments:     { x, y }
:ArgumentTypes: { RealList,  RealList}
:ReturnType:    Real
:End:

:Evaluate: ScalarProd::usage = "ScalarProd[x, y] gives the scalar product of two real lists {x1,x2,...,xn} and {y1,y2,...yn}."

which describes the ScalarProd[] function in Mathematica, and links it to the scalarProd() C method, which is where we receive the two arrays from Mathematica and use CUDA to calculate their scalar product and send the result back. This and the main() function for Linux and Mac, which is what I was using, are in the scalarProd.cu file. Note that Windows has a slightly different main() method.

#include <stdio.h>
#include <stdlib.h>
#include "mathlink.h"
#include "cuda_runtime.h"
int main(int argc, char* argv[])
{
  int deviceCount;
  cudaGetDeviceCount(&deviceCount); 
  if (deviceCount == 0) {                                                  
    fprintf(stderr, "cutil error: no devices supporting CUDA.n");
    exit(EXIT_FAILURE);         
  }
  int dev = 0; 
  cudaDeviceProp deviceProp;
  cudaGetDeviceProperties(&deviceProp, dev);                      
  if (deviceProp.major < 1) {                                             
    fprintf(stderr, "cutil error: device does not support CUDA.n");   
    exit(EXIT_FAILURE);                                                 
  }
  fprintf(stderr, "Using device %d: %sn", dev, deviceProp.name); 
  cudaSetDevice(dev);
 
  return MLMain(argc, argv);
}

and in the same scalarProd.cu we now include the scalarProd_kernel.cu kernel from CUDA’s SDK together with our scalarProd() C function:

#include "scalarProd_kernel.cu"
double scalarProd( double* x, long lenx, double *y, long leny)
{
  float *d_A, *d_B, *d_C;
  float *h_A, *h_B;
  float *h_C_GPU;
  int i;
 
  printf("Initializing data...n");
  h_A     = (float *)malloc(sizeof(float)*lenx);
  h_B     = (float *)malloc(sizeof(float)*leny);
 
  //convert double to float
  for(i = 0; i < lenx; i++){
    h_A[i] = (float)x[i];
    h_B[i] = (float)y[i];
  }
  h_C_GPU = (float *)malloc(sizeof(float));
 
  printf("...allocating GPU memory.n");
  cudaMalloc((void **)&d_A, sizeof(float)*lenx);
  cudaMalloc((void **)&d_B, sizeof(float)*leny);
  cudaMalloc((void **)&d_C, sizeof(float));
  printf("...copying input data to GPU mem.n");
  cudaMemcpy(d_A, h_A, sizeof(float)*lenx, cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, sizeof(float)*leny, cudaMemcpyHostToDevice);
  printf("Data init done.n");
 
  printf("Executing GPU kernel...n");
  cudaThreadSynchronize();
  scalarProdGPU<<<128, 256>>>(d_C, d_A, d_B, 1, lenx);
  cudaThreadSynchronize();
 
  printf("Reading back GPU result...n");
  cudaMemcpy(h_C_GPU, d_C, sizeof(float), cudaMemcpyDeviceToHost);
 
  cudaFree(d_C);
  cudaFree(d_B);
  cudaFree(d_A);
  free(h_B);
  free(h_A);
 
  return h_C_GPU[0];
}

Now we are ready to run Mathematica’s mprep pre-processor from MathLink to generate a scalarProdtm.cu file, and on this we run CUDA’s compiler nvcc and compile everything with the appropriate CUDA and MathLink libraries to generate our scalarProd binary, which we can now call from within Mathematica:

In[1]:= Install["scalarProd"]
Out[1]= LinkObject[./scalarProd,8,7]

In[2]:= ScalarProd[{2.1,3.2},{4.3,5.4}]
Out[2]= 26.31

In[3]:= UnInstall[%1]
Out[3]= UnInstall[LinkObject[./scalarProd,8,7]]