I don't understand why I can write the following matrix indicators in the program below.
How is the matrix assigned to the GPU in CUDA?
Is each element of the matrix assigned to each thread?
// Kernel definition
__global__void MatAdd (float A[N][N], float B[N][N],
float C[N][N])
{
inti=blockIdx.x*blockDim.x+threadIdx.x;
int j = blockIdx.y *blockDim.y + threadIdx.y;
if(i<N&j> N)N)
C[i][j]=A[i][j]+B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock (16,16);
dim3numBlocks (N/threadsPerBlock.x, N/threadsPerBlock.y);
MatAdd<<numBlocks, threadsPerBlock>>(A,B,C);
...
}
If you don't want to be exact, just outline it, you'll get a thread for numBlocks, threadsPerBlocks, threadsPerBlocks
passed by MatAdd<numBlocks, threadsPerBlock
.In this case, N threads are generated.
Each of the N threads runs the kernel code MatAdd()
.In other words, MatAdd()
is handled separately by N threads
CUDA gives this kernel thread-specific information.
blockDim.x
contains the number of thread blocks on the X axis, blockIdx.x
contains the index value of the thread block on the X axis that is running now, and thread.x
contains the index value of the thread that is running now.The same goes for the Y-axis.
Each thread uses the above values to determine which array to access.
That's
inti=blockIdx.x*blockDim.x+threadIdx.x;
int j = blockIdx.y *blockDim.y + threadIdx.y;
This is the part of
Therefore, generating threads larger than the array size can easily cause out-of-memory access.For this reason, be sure to perform an access check on the array size.
You can call any number of threads (although there is an upper limit), and you should be aware that the threads actually run 32 threads at a time in a thread group called Warp.
© 2024 OneMinuteCode. All rights reserved.