## Another GPU Success - PyCUDA, Cross-Correlations

From: andrew cooke <andrew@...>

Date: Thu, 6 Dec 2012 19:58:38 -0300

I've spent the last week investigating using GPUs to calculate
cross-correlaions - something that's important for a couple of clients (it's a
very common operation in seismology because it tells you the relative delay
between two seismic signals recorded in different locations).

The fastest way to calculate correlation coefficients (that I know of, at
least), is to work with FFTs, because you can correlate two signals by
multiplying their Fourier transforms.

So I started with some simple Python numpy code that does exactly that for
synthetic data - it takes about 12s to cross-correlate all the different pairs
when you have 100 spectra, each with 10,000 bins.

Replacing that with code on the GPU, using PyCUDA and PyFFT, reduces the time
to 1.5s.  Taking a little more care, with streams, so that some data is being
that to 1.2s.  And using FFTs tuned for real (ie zero imaginary part) data
(scikits wrapper for CUBLAS) gets the time down to 0.6s.  So the final speedup
is a healthy factor of 20.

Obviously exact timings depend on the amount of data and the hardware used -
but the above is arguably fair, in that both the CPU and GPU are "once good,

And that's just for simple correlations.  I could imagine adding extra
processing to (1) reduce time further and (2) reduce the amount of data that
needs to be transferred back to the CPU.  For example: detecting and returning
the maximum value.

PyCUDA was great.  It took some learning, so initial development was slower,
but once I had things running, it was so much easier to play around, swapping
different options, than with C.  If necessary, I could now translate the
system to C, but my hope is that it will work as a standalone "correlation
service", if we can find a suitable interface (eg reading / writing files,
perhaps to an SSD).

One odd wrinkle, that perhaps is me simply not understanding something, is
that the scikits interface seems to be defined in terms of cudaarray
instances, which seem to me to be sugar in PyCUDA - the "fundamental"
interface being an opaque pointer to memory.  This complicates things when you
want low-level control in other parts of the code.

Andrew