Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function.
In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints.
But in my project, I want to launch the kernel but parent kernel doesn't need to wait for exiting of child kernel. In other words, they are completely independent.
So is there some ways to launch the kernel in kernel functions but not limited by recursion depth?
I use the cudaDeviceSetLimit()
to set cudaLimitDevRuntimeSyncDepth
but there are still restrictions.
example:
__global__ void do_something(MyQueue* queue, Task* task) {
// do something ...
task->execute();
// If queue is not empty, pop from it and launch a kernel to execute it
Task* t = queue->pop();
if (t) {
do_something<<<t->gridSize, t->blockSize, t->mem, stream>>>(queue, t);
}
}
GPU: NVIDIA A100 CUDA version: 12.4
compile flag: -rdc=true -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61