Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

[Link, Computing] Useful gcc flags; [Link] Voynich Manuscript Decoded; [Bike] Notes on Servicing Suspension Forks; [Links, Computing] Snap, Flatpack, Appimage; [Link, Computing] Oracle is leaving Java (to die); [Link, Politics] Cubans + Ultrasonics; [Book, Link] Laurent Binet; VirtualBox; [Book, Link] No One's Ways; [Link] The Biggest Problem For Cyclists Is Bad Driving; [Computing] Doxygen, Sphinx, Breathe; [Admin] Brokw Recent Permalinks; [Bike, Chile] Buying Bearings in Santiago; [Computing, Opensuse] Upgrading to 42.3; [Link, Physics] First Support for a Physics Theory of Life; [Link, Bike] Peruvian Frame Maker; [Link] Awesome Game Theory Tit-For-Tat Thing; [Food, Review] La Fabbrica - Good Italian Food In Santiago; [Link, Programming] MySQL UTF8 Broken; [Link, Books] Latin American Authors; [Link, Computing] Optimizatin Puzzle; [Link, Books, Politics] Orwell Prize; [Link] What the Hell Is Happening With Qatar?; [Link] Deep Learning + Virtual Tensor Machines; [Link] Scaled Composites: Largest Wingspan Ever; [Link] SCP Foundation; [Bike] Lessons From 2 Leading 2 Trailing; [Link] Veg Restaurants in Santiago; [Link] List of Contemporary Latin American Authors; [Bike] FTHR; [Link] Whoa - NSA Reduces Collection (of US Residents); [Link] Red Bull's Breitbart; [Link] Linux Threads; [Link] Punycode; [Link] Bull / Girl Statues on Wall Street; [Link] Beautiful Chair Video; Update: Lower Pressures; [Link] Neat Python Exceptions; [Link] Fix for Windows 10 to Avoid Ads; [Link] Attacks on ZRTP; [Link] UK Jazz Invasion; [Review] Cuba; [Link] Aricle on Gender Reversal of US Presidential Debate; {OpenSuse] Fix for Network Offline in Updater Applet; [Link] Parkinson's Related to Gut Flora; Farellones Bike Park; [Meta] Tags; Update: Second Ride; Schwalbe Thunder Burt 2.1 v Continental X-King 2.4; Mountain Biking in Santiago; Books on Ethics; Security Fail from Command Driven Interface; Everything Old is New Again; Interesting Take on Trump's Lies; Chutney v6; References on Entropy; Amusing "Alexa.." broadcast; The Shame of Chile's Education System; Playing mp4 gifs in Firefox on Opensuses Leap 42.2; Concurrency at Microsoft; Globalisation: Uk -> Chile; OpenSuse 42.2 and Synaptics Touch-Pads; Even; Cherry Jam; Lebanese Writer Amin Maalouf; C++ - it's the language of the future; Learning From Trump; Chinese Writer Hu Fayun; And; Apricot Jam; Also; Excellent Article on USA Politics; Oh Metafilter; Prejudice Against The Rurals; Also, Zizek; Trump; Why Trump Won; Doxygen + Latex on CentOS 6; SMASH - Solve 5 Biggest Problems in Physics; Good article on racism, brexit, and social divides; Grandaddy are back!; Consciousness From Max Entropy; Democrats; Harvard Will Fix Black Poverty; Modelling Bicycle Wheels; Amusing Polling Outlier; If Labour keeps telling working class people...; Populism and Choice; Books on Defeat; Enrique Ferrari - Argentine Author; Transcript of German Scientists on Learning of Hiroshima; Calvert Journal; Owen Jones on Twitter; Possible Japanese Authors; Complex American Literature; Chutney v5; Weird Componentized Virus; Interesting Argentinian Author - Antonio Di Benedetto; Useful Thread on MetaPhysics; RAND on fighting online anarchy (2001); NSA Hacked

© 2006-2017 Andrew Cooke (site) / post authors (content).

Experience Optimising Matlab Code with OpenCL (NVidia GPGPU)

From: andrew cooke <andrew@...>

Date: Thu, 17 Dec 2009 18:07:21 -0300

I just got some solid results on work I have been doing, here and
there, to optimize calculations originally written in Matlab.  While I
can't give full details, I hope I can give enough to be useful to
others in a similar position.

To motivate the rest of this, I'll say up front that it has been a big
success (IMHO).  The "core calculation" is now 77x faster than the
original Matlab code, using only a relatively cheap graphics card
(NVidia 9800 GT).  We bought that card for proof of concept but since
our processing time is now dominated by other factors it's not clear
we need anything better.

One of the details I cannot describe is the actual calculation, except
to say that at first sight it looks like it would be "trivially
parallelizable" because it's a bunch of nested loops that read from a
huge data structure.  However, on second inspection the inner loop has
some indirection that means it's unlikely that reads will coalesce (in
simple terms when adjacent cores in the GPU read adjacent addresses in
memory it can be read in an efficient stream; this is unlikely for

I did the optimisation in three stages.

First, I tried vectorizing the code within Matlab.  My understanding
was that while explicit sequences on Matlab commands run in a single
thread, individual commands may spawn work to other cores (and we have
a nice multi core Xeon box).  This did not work.  The issue may be
related to the exact structure of our problem (the indirection
described above), or perhaps we need to enable something with Matlab?
Whatever the cause, it was a big disappointment - seems to me Matlab
should be smarter than that, given the current rise of parallelism.

Second, I rewrote the inner loop in C.  No amazingly cool
optimisations, just the usual care over data layout etc.  This gave a
speedup of a factor of 6.  At this point it's probably worth
mentioning that I did most of the development work with Octave - - to avoid inconveniences related
to Matlab licensing, geography, etc.  To get full compatibility I
needed to build the latest version from their VCS, but otherwise it
worked out nicely.  Even the C interface is the same (Octave support
the Matlab mex file conventions).  One warning though - don't expect
to get useful information on timings: you need to measure performance
changes on Matlab.

Third, I then moved that to OpenCL on the GPU.  This gave a speedup of
77 (13 relative to C).  I don't have more details yet, but I assume we
have so many threads that the latency of the incoherent reads is
amortized quite nicely.  Again I did most of the development on a
different machine (with an 8400 GPU).  And again this worked nicely -
most of the effort was simply getting it to work at all (ie not
segfault); the numbers above are for code that has not been profiled
or optimised in any way....

What more can I say?  Various random details:

- OpenCL (with NVidia drivers) has been completely solid.  The
documentation is good too.  It's a bit of a pain to debug (all the joy
of C without print statements) and it took me some time to understand
the exact relationship between OpenCL and CUDA concepts, but the gain
has clearly outweighed the pain.

- I spent much more time worrying about memory accesses than I did
about anything else.  As I said above, I still haven't profiled this,
so I may have been prematurely optimising (or at least worrying
without cause, since the results of my worrying didn't significantly
change the code), but it seems to be a general rule that GPU
processing is limited by memory reads (and that this will get worse as
the number of cores increases).

- The point above is explained quite nicely in
- basically, unlike a CPU, a GPU has no cache.  When it is stalled
waiting for memory it simply switches to another thread.  This is why
it helps to have big loops.

- If you don't specify the work group size in the OpenCL EnqueueND...
call, then it appears to be set to 1!  This is absolutely crazy, if I
understand correctly, because it means that only one thread in a weft
is running (basically you want this number to be at least 32).  Having
said that, the 9800 appeared to handle this much better than then
8400, so perhaps more recent hardware is more flexible?

- Calculating a work group size automatically is non-trivial, because
it has to be a divisor of the total number of work items.  As far as I
can see there's no better way that factorising the total number and
then trying each different combination of factors to see which gets
closest to (but does not exceed) 512.  In practice, it's best to
expose this in your API (as a fallback I also have an "auto" option
that starts with the total number and throws away small factors until
the number is low enough - not brilliant, but simple) (there may be a
greedy algorithm to do this, but proof of concept code in C isn't the
time or place....) (and it's not Euclid's GCD either!)

- If you have large datasets, you need to worry about memory limits.
Current technology is 32bits, which means there's a limit of 4GB.  And
frustratingly it's in 4 separate banks (as far as I can tell you
cannot allocate more than 1/4 of the total memory on any NVidia GPU in
a single block) - you can explicitly jump between blocks in the kernel
(ie check offset and select array accordingly), but it makes life
messy.  Thankfully the next generation (Fermi, out early 2010 I
believe) has 64 bit addresses which should make this all go away (it
also has some kind of L2-like cache, which is interesting...).

- I hope to profile what I have and perhaps implement a better kernel
(I think we could exploit local memory...), but the initial results
were so good that there's little incentive to spend more time
improving what we have!

- When you're interfacing OpenCL / C with Matlab it helps to pass a
pointer back into Matlab (so that you can have persistent state across
calls).  The standard hack for doing this really does seem to be to
use a singleton matrix with a 64bit int on the Matlab side.  I found
some discussion in the Matlab support forums, and it worked for me.

- OpenCL (at least on the early/cheap hardware) is 32bit, while Matlab
uses doubles and longs.  So you need to convert your Matlab code with
single() and int32().  You also need to worry about lack of precision,

- See also


nice work

From: "John Melonakos" <john.melonakos@...>

Date: Thu, 17 Dec 2009 20:36:12 -0500

Hi Andrew,


Saw your post:


Looks like you've got some cool things cooking.  I'm with AccelerEyes
building Jacket.  There might be some possibilities for working together.
If you have some free time after the Holidays, it'd be nice to chat with you
by phone.






John Melonakos, PhD



Atlanta, GA USA

T: (770) 315-1099

Twitter:  @accelereyes <> 

W: <> 


The information transmitted in this email is intended only for the person or
entity to which it is addressed and contains AccelerEyes LLC confidential
and proprietary information. It may also refer to such AccelerEyes LLC
confidential or proprietary information that is subject to the terms of a
confidentiality agreement.

Further Optimisation with OpenCL

From: andrew cooke <andrew@...>

Date: Fri, 1 Jan 2010 09:53:24 -0300

I had time to look at optimising the OpenCL code yesterday.  I tried
two approaches, and both gave significant gains, but both also have

First, I tried changing an array use to store intermediate data from
global to local.  This means that instead of storing the data in the
main memory for the GPU, it is stored in an on-chip cache (and so is
much faster to access).  The amount of space is limited, so I also had
to restructure the program to calculate the absolute minimum amount of
data (which meant swapping the order of two loops nested inside each

This worked - gave roughly a factor of 1.5 speedup on a simple test -
but the amount of memory required was too much for most use cases (the
cache is just 16kB, and even on Fermi will be only 48kB).

Second, I tried changing an array to a 2D image.  As I've described
elsewhere, the main difficulty we have is that the inner operation of
our loops contains a level of indirection - we need to read data from
an array, but not in a completely uniform, methodical manner.  This
means that access to that data is fairly slow.

An image is the abstraction used by OpenCL for texture maps.  I am
unsure exactly how NVidia implements these, but the end result is that
there is some kind of local cache that is used to improve access times
to arrays when they are used in a non-uniform way (an obvious example
is the kernel for direct convolution; originally I guess from the name
that they held texture images that are mapped to a 3D scene).

Obviously this is fairly risky - the image (in practice I actually
needed to use multiple images work around size limitation) is much
bigger than any cache memory, so whether a speedup is seen will depend
on the access pattern (you need bunched, repeated access).  Luckily,
this fits nicely with our problem, and gave a doubling of speed on our
worst performing dataset :o)

This work took a day, so the payoff is significant - twice the speed
for a day's work means that it is definitely worth investigating.

(I should admit here that I still haven't used NVidia's profiler, but
I am increasingly confident in my mental model of how the code/GPU are
working together.)


Comment on this post