Andrew Cooke | Contents | Latest | RSS | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

[Programming] React Leaflet; AliExpress Independent Sellers; Applebaum - Twilight of Democracy; [Politics] Back + US Elections; [Programming,Exercise] Simple Timer Script; [News] 2019: The year revolt went global; [Politics] The world's most-surveilled cities; [Bike] Hope Freehub; [Restaurant] Mama Chau's (Chinese, Providencia); [Politics] Brexit Podcast; [Diary] Pneumonia; [Politics] Britain's Reichstag Fire moment; install cairo; [Programming] GCC Sanitizer Flags; [GPU, Programming] Per-Thread Program Counters; My Bike Accident - Looking Back One Year; [Python] Geographic heights are incredibly easy!; [Cooking] Cookie Recipe; Efficient, Simple, Directed Maximisation of Noisy Function; And for argparse; Bash Completion in Python; [Computing] Configuring Github Jekyll Locally; [Maths, Link] The Napkin Project; You can Masquerade in Firewalld; [Bike] Servicing Budget (Spring) Forks; [Crypto] CIA Internet Comms Failure; [Python] Cute Rate Limiting API; [Causality] Judea Pearl Lecture; [Security, Computing] Chinese Hardware Hack Of Supermicro Boards; SQLAlchemy Joined Table Inheritance and Delete Cascade; [Translation] The Club; [Computing] Super Potato Bruh; [Computing] Extending Jupyter; Further HRM Details; [Computing, Bike] Activities in ch2; [Books, Link] Modern Japanese Lit; What ended up there; [Link, Book] Logic Book; Update - Garmin Express / Connect; Garmin Forerunner 35 v 230; [Link, Politics, Internet] Government Trolls; [Link, Politics] Why identity politics benefits the right more than the left; SSH Forwarding; A Specification For Repeating Events; A Fight for the Soul of Science; [Science, Book, Link] Lost In Math; OpenSuse Leap 15 Network Fixes; Update; [Book] Galileo's Middle Finger; [Bike] Chinese Carbon Rims; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life; [Link, Bike] Cheap Cycling Jerseys; [Link, Music] Music To Steal 2017; [Link, Future] Simulated Brain Drives Robot; [Link, Computing] Learned Index Structures; Solo Air Equalization; Update: Higher Pressures; Psychology; [Bike] Exercise And Fuel; Continental Race King 2.2; Removing Lowers; Mnesiacs; [Maths, Link] Dividing By Zero; [Book, Review] Ray Monk - Ludwig Wittgenstein: The Duty Of Genius; [Link, Bike, Computing] Evolving Lacing Patterns; [Jam] Strawberry and Orange Jam; [Chile, Privacy] Biometric Check During Mail Delivery; [Link, Chile, Spanish] Article on the Chilean Drought; [Bike] Extended Gear Ratios, Shimano XT M8000 (24/36 Chainring); [Link, Politics, USA] The Future Of American Democracy; Mass Hysteria; [Review, Books, Links] Kazuo Ishiguro - Never Let Me Go; [Link, Books] David Mitchell's Favourite Japanese Fiction; [Link, Bike] Rear Suspension Geometry; [Link, Cycling, Art] Strava Artwork; [Link, Computing] Useful gcc flags; [Link] Voynich Manuscript Decoded; [Bike] Notes on Servicing Suspension Forks; [Links, Computing] Snap, Flatpack, Appimage; [Link, Computing] Oracle is leaving Java (to die); [Link, Politics] Cubans + Ultrasonics; [Book, Link] Laurent Binet; VirtualBox; [Book, Link] No One's Ways; [Link] The Biggest Problem For Cyclists Is Bad Driving; [Computing] Doxygen, Sphinx, Breathe; [Admin] Brokw Recent Permalinks; [Bike, Chile] Buying Bearings in Santiago; [Computing, Opensuse] Upgrading to 42.3; [Link, Physics] First Support for a Physics Theory of Life; [Link, Bike] Peruvian Frame Maker; [Link] Awesome Game Theory Tit-For-Tat Thing; [Food, Review] La Fabbrica - Good Italian Food In Santiago; [Link, Programming] MySQL UTF8 Broken; [Link, Books] Latin American Authors

© 2006-2017 Andrew Cooke (site) / post authors (content).

Some Initial Results for Overlapping Tiles with CUDA

From: "andrew cooke" <andrew@...>

Date: Mon, 28 Jul 2008 20:36:24 -0400 (CLT)

I wrote the following code to simulate (perhaps not exactly) the memory
loads that would occur using CUDA if the data were processed using a
tiling that overlaps (so something like the Matrix example, but with
leaking across the boundaries of the box - perhaps for convolving with a
kernel, for example, or, as in my case, calculating "life").

Using the approach shown in the code (large integer types and a single
overlap) is inefficient because (at least for CUDA 1.0 ad 1.1) the reads
cannot be coalesced - either the half weft is the wrong width, or the
shift (which is less than the half-weft to allow for overlapping) is
wrong.

I'm going to see now if using a smaller integer type and overlapping a
whole half-weft makes more sense (sounds crazy, but might work...).

Andrew


(these is just the core block to give some idea of what's happening)

// run through each tile position
int count = 0;
for (int j = 0; j < nY; j++) {
    for (int i = 0; i < nX; i++) {
        // for each tile, run through the half-warps
        for (int k = 0; k < nHalfWarps; k++) {
            for (int l = 0; l < halfWarp; l++) {
                int localOffset = k * halfWarpWidth + l * word;
                int localX = localOffset % windowX;
                int localY = localOffset / windowX;
                int globalX = i * strideX + localX;
                int globalY = j * strideY + localY;
                int globalOffset = globalY * (*paddedX) + globalX;
                int segStart = globalOffset / segment;
                int segEnd =                                        \
                    (globalOffset + halfWarpWidth - word) / segment;

                if (prop.minor < 2) {
                    // 1.0 and 1.1 are really strict about what will
                    // be coalesced.
                    if (segStart == segEnd) {
                        count = count + 1;
                    } else {
                        count = count + halfWarp;
                    }
                } else {
                    // 1.2 is more lenient and simply groups as
                    // necessary
                    count = count + segEnd - segStart + 1;
                }
            }
        }
    }
}


And the output:

Loads for 1234,1234 using  184,  20 stepping  176,  19
8 bytes/word; 128 segments; 16 half-warp
Best count 3180010 for 184, 20 over 1240,1236

See how the stepping here is 8 bytes in 8 because I used 8 byte ints (even
though I only need 1 bit overlap)

The total number of theeads per block would be 184*20/8 = 460.

Better Code + Numbers

From: "andrew cooke" <andrew@...>

Date: Mon, 28 Jul 2008 21:45:23 -0400 (CLT)

There were a fair number of bugs in teh code above.  Not sure I have it
right yet, but I seem to be getting numbers that make more sense.

So, the possible tactics are:

1 - Use a large integer and overlap only as little as possible.
2 - Use a small integer and overlap by a whole segment
3 - Use a large integer and overlap by a whole segment

For a "very large" (ie each dimension significantly larger than the
largest possible tile dimension) data area, searching only over the
largest tiles (ie given X, calculate Y from memory limitations etc) the
relative numbers of memory loads (smaller the better) are:

1 - 10
2 - 2
3 - 1

So it's clearly better to overlap by a whole segment, even though more
memory is "thrown away" (as expected).  The relative speeds for the two
integer sizes just reflects the sizes themselves (4 v 8 bytes).  Since
larger integers load more slowly this may not be significant.

For the original size I was using (1234 x 1234 bytes) things are less
clear because the size of the tile approaches the size of the data in some
configurations, so tweaking tiles shapes becomes significant (in fact [1]
won out because a tile could cover all the data, but [3] was still close).

Andrew

Comment on this post