Kirk & Hwu: Progamming Massively Parallel Processors (Morgan Kaufmann) just arrived! Only concern so far is I'm a quarter through the thing and it's still explaining cudaMalloc() and cudaMemcpy(). Nothing wrong with a solid foundation I suppose. The highlight thus far: their main recurring example is a matrix-matrix multiplication function.

