Andrew Cooke | Contents | Latest | RSS | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

Surprise Paradox; [Books] Good Author List; [Computing] Efficient queries with grouping in Postgres; [Computing] Automatic Wake (Linux); [Computing] AWS CDK Aspects in Go; [Bike] Adidas Gravel Shoes; [Computing, Horror] Biological Chips; [Books] Weird Lit Recs; [Covid] Extended SIR Models; [Art] York-based Printmaker; [Physics] Quantum Transitions are not Instantaneous; [Computing] AI and Drum Machines; [Computing] Probabilities, Stopping Times, Martingales; bpftrace Intro Article; [Computing] Starlab Systems - Linux Laptops; [Computing] Extended Berkeley Packet Filter; [Green] Mainspring Linear Generator; Better Approach; Rummikub Solver; Chilean Poetry; Felicitations - Empowerment Grant; [Bike] Fixing Spyre Brakes (That Need Constant Adjustment); [Computing, Music] Raspberry Pi Media (Audio) Streamer; [Computing] Amazing Hack To Embed DSL In Python; [Bike] Ruta Del Condor (El Alfalfal); [Bike] Estimating Power On Climbs; [Computing] Applying Azure B2C Authentication To Function Apps; [Bike] Gearing On The Back Of An Envelope; [Computing] Okular and Postscript in OpenSuse; There's a fix!; [Computing] Fail2Ban on OpenSuse Leap 15.3 (NFTables); [Cycling, Computing] Power Calculation and Brakes; [Hardware, Computing] Amazing Pockit Computer; Bullying; How I Am - 3 Years Post Accident, 8+ Years With MS; [USA Politics] In America's Uncivil War Republicans Are The Aggressors; [Programming] Selenium and Python; Better Walking Data; [Bike] How Fast Before Walking More Efficient Than Cycling?; [COVID] Coronavirus And Cycling; [Programming] Docker on OpenSuse; Cadence v Speed; [Bike] Gearing For Real Cyclists; [Programming] React plotting - visx; [Programming] React Leaflet; AliExpress Independent Sellers; Applebaum - Twilight of Democracy; [Politics] Back + US Elections; [Programming,Exercise] Simple Timer Script; [News] 2019: The year revolt went global; [Politics] The world's most-surveilled cities; [Bike] Hope Freehub; [Restaurant] Mama Chau's (Chinese, Providencia); [Politics] Brexit Podcast; [Diary] Pneumonia; [Politics] Britain's Reichstag Fire moment; install cairo; [Programming] GCC Sanitizer Flags; [GPU, Programming] Per-Thread Program Counters; My Bike Accident - Looking Back One Year; [Python] Geographic heights are incredibly easy!; [Cooking] Cookie Recipe; Efficient, Simple, Directed Maximisation of Noisy Function; And for argparse; Bash Completion in Python; [Computing] Configuring Github Jekyll Locally; [Maths, Link] The Napkin Project; You can Masquerade in Firewalld; [Bike] Servicing Budget (Spring) Forks; [Crypto] CIA Internet Comms Failure; [Python] Cute Rate Limiting API; [Causality] Judea Pearl Lecture; [Security, Computing] Chinese Hardware Hack Of Supermicro Boards; SQLAlchemy Joined Table Inheritance and Delete Cascade; [Translation] The Club; [Computing] Super Potato Bruh; [Computing] Extending Jupyter; Further HRM Details; [Computing, Bike] Activities in ch2; [Books, Link] Modern Japanese Lit; What ended up there; [Link, Book] Logic Book; Update - Garmin Express / Connect; Garmin Forerunner 35 v 230; [Link, Politics, Internet] Government Trolls; [Link, Politics] Why identity politics benefits the right more than the left; SSH Forwarding; A Specification For Repeating Events; A Fight for the Soul of Science; [Science, Book, Link] Lost In Math; OpenSuse Leap 15 Network Fixes; Update; [Book] Galileo's Middle Finger; [Bike] Chinese Carbon Rims; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life

© 2006-2017 Andrew Cooke (site) / post authors (content).

Some Initial Results for Overlapping Tiles with CUDA

From: "andrew cooke" <andrew@...>

Date: Mon, 28 Jul 2008 20:36:24 -0400 (CLT)

I wrote the following code to simulate (perhaps not exactly) the memory
loads that would occur using CUDA if the data were processed using a
tiling that overlaps (so something like the Matrix example, but with
leaking across the boundaries of the box - perhaps for convolving with a
kernel, for example, or, as in my case, calculating "life").

Using the approach shown in the code (large integer types and a single
overlap) is inefficient because (at least for CUDA 1.0 ad 1.1) the reads
cannot be coalesced - either the half weft is the wrong width, or the
shift (which is less than the half-weft to allow for overlapping) is
wrong.

I'm going to see now if using a smaller integer type and overlapping a
whole half-weft makes more sense (sounds crazy, but might work...).

Andrew


(these is just the core block to give some idea of what's happening)

// run through each tile position
int count = 0;
for (int j = 0; j < nY; j++) {
    for (int i = 0; i < nX; i++) {
        // for each tile, run through the half-warps
        for (int k = 0; k < nHalfWarps; k++) {
            for (int l = 0; l < halfWarp; l++) {
                int localOffset = k * halfWarpWidth + l * word;
                int localX = localOffset % windowX;
                int localY = localOffset / windowX;
                int globalX = i * strideX + localX;
                int globalY = j * strideY + localY;
                int globalOffset = globalY * (*paddedX) + globalX;
                int segStart = globalOffset / segment;
                int segEnd =                                        \
                    (globalOffset + halfWarpWidth - word) / segment;

                if (prop.minor < 2) {
                    // 1.0 and 1.1 are really strict about what will
                    // be coalesced.
                    if (segStart == segEnd) {
                        count = count + 1;
                    } else {
                        count = count + halfWarp;
                    }
                } else {
                    // 1.2 is more lenient and simply groups as
                    // necessary
                    count = count + segEnd - segStart + 1;
                }
            }
        }
    }
}


And the output:

Loads for 1234,1234 using  184,  20 stepping  176,  19
8 bytes/word; 128 segments; 16 half-warp
Best count 3180010 for 184, 20 over 1240,1236

See how the stepping here is 8 bytes in 8 because I used 8 byte ints (even
though I only need 1 bit overlap)

The total number of theeads per block would be 184*20/8 = 460.

Better Code + Numbers

From: "andrew cooke" <andrew@...>

Date: Mon, 28 Jul 2008 21:45:23 -0400 (CLT)

There were a fair number of bugs in teh code above.  Not sure I have it
right yet, but I seem to be getting numbers that make more sense.

So, the possible tactics are:

1 - Use a large integer and overlap only as little as possible.
2 - Use a small integer and overlap by a whole segment
3 - Use a large integer and overlap by a whole segment

For a "very large" (ie each dimension significantly larger than the
largest possible tile dimension) data area, searching only over the
largest tiles (ie given X, calculate Y from memory limitations etc) the
relative numbers of memory loads (smaller the better) are:

1 - 10
2 - 2
3 - 1

So it's clearly better to overlap by a whole segment, even though more
memory is "thrown away" (as expected).  The relative speeds for the two
integer sizes just reflects the sizes themselves (4 v 8 bytes).  Since
larger integers load more slowly this may not be significant.

For the original size I was using (1234 x 1234 bytes) things are less
clear because the size of the tile approaches the size of the data in some
configurations, so tweaking tiles shapes becomes significant (in fact [1]
won out because a tile could cover all the data, but [3] was still close).

Andrew

Comment on this post