The code used to multiply the matrix in CUDA increases the class and non-square matrix, however, Both width and height are necessary for multiples of blocks.
Therefore, for example, I can increase [3] [6] * [6] [3] (using BlockIij = 3), but I can not raise [3] [2] * [2]] [3].
Does anyone know how to do this? This is my kernel:
#include & lt; Stdio.h & gt; # Include & lt; Limits.h & gt; # Include & lt; Stdlib.h & gt; #define blockize 3 # Define defined HM (1 * blockies) # Define WM (2 * blocksax) # Define WN (1 * blockage) # HNW # define WP WN # HP HM # define PTH WM #define PTW HM __global__ zero Nonsquare (Float * M, Float * N, Float * P, Int UWM, Int UWN) {__shared__ Float MS [Blongiz] [Blocks]; __shared__ float ns [blocksize] [blocks]; Int tx = threadIdx.x, ty = threadIdx.y, bx = blockIdx.x, = blockIdx.y; By int by rowm = ty *; Int colN = tx + bx * blocks; Float pv = 0; (MS [Type] [Tx] = M [RMM * UWM + (M * Bloxise + Tx]]; (Int M = 0; M & LT; UWM / Blockies; ++ M) NS [TE] [Tx] = M [Cole n + UWN * (M * Bloxize + ti)]; __syncthreads (); For (int k = 0; k Thanks in advance!
I think the easiest thing to do is to pad the blocks at the end: For
(int m = 0; m & lt; uwm / blocksize; ++ m) {colm = m * blocksax + tx; Row N = M * Blocks · + Tie; If (RMM> uWN || rowN> uWM || Columns> uWM || Cole NUN) {MS [Type] [Tx] = 0 ;; Ns [ti] [tx] = 0.; } Else {MS [ty] [tx] = M [rowM * uWM + colm]; NS [Ti] [Tx] = N [Cole N + UWN * Row N]; } Plus or minus (should refer to that NS line N, not M, right?)
But since I think the current tuned Why not use the only one who advocates to use libraries - why not use it or instead of rolling your own? They are fast, and tested by hundreds of users.
Comments
Post a Comment