1

Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function. In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints.
But in my project, I want to launch the kernel but parent kernel doesn't need to wait for exiting of child kernel. In other words, they are completely independent.
So is there some ways to launch the kernel in kernel functions but not limited by recursion depth?

I use the cudaDeviceSetLimit() to set cudaLimitDevRuntimeSyncDepth but there are still restrictions.

example:

__global__ void do_something(MyQueue* queue, Task* task) {
  // do something ...
  task->execute();

  // If queue is not empty, pop from it and launch a kernel to execute it
  Task* t = queue->pop();
  if (t) {
    do_something<<<t->gridSize, t->blockSize, t->mem, stream>>>(queue, t);
  }
}

GPU: NVIDIA A100 CUDA version: 12.4

compile flag: -rdc=true -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.