Understanding the Process Assigned to CUDA Programming Thread

Asked 1 years ago, Updated 1 years ago, 64 views

I don't understand why I can write the following matrix indicators in the program below.
How is the matrix assigned to the GPU in CUDA?
Is each element of the matrix assigned to each thread?

// Kernel definition
__global__void MatAdd (float A[N][N], float B[N][N],
float C[N][N])
{
    inti=blockIdx.x*blockDim.x+threadIdx.x;
    int j = blockIdx.y *blockDim.y + threadIdx.y;
    if(i<N&j> N)N)
        C[i][j]=A[i][j]+B[i][j];
}

int main()
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock (16,16);
    dim3numBlocks (N/threadsPerBlock.x, N/threadsPerBlock.y);
    MatAdd<<numBlocks, threadsPerBlock>>(A,B,C);
    ...
}

c cuda

2022-09-30 19:26

1 Answers

If you don't want to be exact, just outline it, you'll get a thread for numBlocks, threadsPerBlocks, threadsPerBlocks passed by MatAdd<numBlocks, threadsPerBlock.In this case, N threads are generated.

Each of the N threads runs the kernel code MatAdd().In other words, MatAdd() is handled separately by N threads

.

CUDA gives this kernel thread-specific information.

blockDim.x contains the number of thread blocks on the X axis, blockIdx.x contains the index value of the thread block on the X axis that is running now, and thread.x contains the index value of the thread that is running now.The same goes for the Y-axis.

Each thread uses the above values to determine which array to access.
That's

 inti=blockIdx.x*blockDim.x+threadIdx.x;
int j = blockIdx.y *blockDim.y + threadIdx.y;

This is the part of

Therefore, generating threads larger than the array size can easily cause out-of-memory access.For this reason, be sure to perform an access check on the array size.

You can call any number of threads (although there is an upper limit), and you should be aware that the threads actually run 32 threads at a time in a thread group called Warp.


2022-09-30 19:26

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.