Questions tagged [dynamic-parallelism]
dynamic parallelism refers to a capability in CUDA for device kernel launches to be performed from within a device kernel
dynamic-parallelism
52
questions
9
votes
1
answer
9k
views
compilation .cu files with Dynamic Parallelism(CUDA)
I switched to a new GPU GeForce GTX 980 with cc 5.2, so it must support dynamic parallelism. However, I was not able to compile even a simple code (from programming guide). I will not provide it here (...
7
votes
2
answers
6k
views
CUDA Dynamic Parallelism, bad performance
We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach.
We made the simplest reproducible code to ...
7
votes
1
answer
618
views
CUDA device runtime api cudaMemsetAsync doesn't work
I am trying to call cudaMemsetAsync from kernel (so called "dynamic parallelism"). But no matter what value I use, it always set memory to 0.
Here is my test code:
#include "cuda_runtime.h"
#include ...
7
votes
0
answers
196
views
AleaGPU Dynamic Parallelism in F#? How?
This might be a simple question, but I have not been able to find any references to this topic: How do I launch a kernel from within another kernel?. The only relevant example I came across is the ...
3
votes
1
answer
956
views
CUDA dynamic parallelism with Driver API
I'm trying to compile and link a dynamic kernel and use it with the CUDA driver API on a GK110.
I compile the .cu source file in Visual Studio with the relocatable device code flag and compute_35, ...
3
votes
1
answer
189
views
Dynamic Parallelism on GTX 980 ti: Unknown Error
I am attempting dynamic parallelism on a GTX 980 ti card.
All attempts at running code return "unknown error".
Simple code is shown below with compilation options.
I can execute kernels at depth=0 ...
3
votes
1
answer
371
views
What factors effect the overhead of dynamic parallelism kernel launches?
When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, ...
3
votes
1
answer
952
views
How to perform relational join on two data containers on GPU (preferably CUDA)?
What I'm trying to do:
On the GPU, I'm trying to mimic the conventions used by SQL in relational algebra to perform joins on tables (e.g. Inner Join, Outer Join, Cross Join). In the code below, I'm ...
2
votes
1
answer
2k
views
CUDA - How to make thread in kernel wait for it's children
I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait ...
2
votes
1
answer
3k
views
Dynamic parallelism - launching many small kernels is very slow
I am trying to use dynamic parallelism to improve an algorithm I have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What I want to do is to first ...
2
votes
3
answers
1k
views
Kepler CUDA dynamic parallelism and thread divergence
There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
...
2
votes
1
answer
519
views
"unknown error" on first cudaMalloc if CUBLAS is present in kernel
I have the following minimal .cu file
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
#include <cstdio>
__global__ void test()
{
cublasHandle_t handle = nullptr;
...
2
votes
0
answers
203
views
Does nvcc support tail call optimization in dynamic parallelism?
Under the CUDA Programming Guide section C.4.3.1.2. "Nesting and Synchronization Depth", it is mentioned:
"An optimization is permitted where the system detects that it need not reserve space for ...
1
vote
2
answers
1k
views
Understanding Dynamic Parallelism in CUDA
Example of dynamic parallelism:
__global__ void nestedHelloWorld(int const iSize,int iDepth) {
int tid = threadIdx.x;
printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,...
1
vote
1
answer
2k
views
CUDA recursion depth
When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah.
My question is how many ...
1
vote
2
answers
1k
views
Nested Directives in OpenACC
I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.
My code is so simple. When I try to compile ...
1
vote
1
answer
2k
views
Is it possible to call cublas functions from a device function?
In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas ...
1
vote
1
answer
1k
views
How to compile a .cu with dynamic parallelism? [closed]
I have 2 cpp files setup and functions, 6 .cu files main, flood, timestep, discharge, continuity and copy.
I'm trying to compile this to the main call the cpp files and so the flood kernel global and ...
1
vote
1
answer
543
views
Why can't I link to my CUDA static library that uses Dynamic Parallelism and Separable Compilation?
I'm trying to create the most basic CUDA application to demonstrate Dynamic Parallelism, Separate Compilation and Linking, a CUDA kernel in a static library, and I'm trying to use CMake to generate a ...
1
vote
1
answer
945
views
How to call a Thrust function in a stream from a kernel?
I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be ...
1
vote
1
answer
757
views
CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism
Kernel codes that produce the error:
__kernel void testDynamic(__global int *data)
{
int id=get_global_id(0);
atomic_add(&data[1],2);
}
__kernel void test(__global int * data)
{
int ...
1
vote
1
answer
1k
views
numba.typeinfer.TypingError: Untyped global name 'child_launch' when using CUDA Dynamic Parallelism in Python ( Anaconda ) on NVIDIA GPU
My code is here:
import numpy as np
from numbapro import cuda
@cuda.autojit
def child_launch(data):
data[cuda.threadIdx.x] = data[cuda.threadIdx.x] + 100
@cuda.autojit
def parent_launch(data):
...
1
vote
0
answers
6
views
CUDA dynamic parallelism -- Is there a way to infinitely nest kernel launches?
Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function.
In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints.
...
1
vote
3
answers
3k
views
Generating Relocatable Device Code using Nvidia Nsight
I'm trying to compile a dynamic parallelism example on CUDA and when i try to compile it gives and error saying,
kernel launch from __device__ or __global__ functions requires separate compilation ...
0
votes
1
answer
907
views
Synchronization in CUDA dynamic parallelism
I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion:
__global__ void getMax(...
0
votes
1
answer
340
views
CUDA dynamic parallelism: Access child kernel results in global memory
I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
...
0
votes
1
answer
536
views
Can a CUDA parent kernel launch a child kernel with more threads than the parent?
I'm trying to learn how to use CUDA Dynamic Parallelism.
I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with ...
0
votes
1
answer
460
views
Synchronizing depth of nested kernels
Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then ...
0
votes
1
answer
1k
views
Dynamic parallelism - passing contents of shared memory to spawned blocks?
While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able ...
0
votes
1
answer
876
views
"device-function-maxrregcount" message while compiling cuda code
I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code:
using ...
0
votes
1
answer
164
views
Accessing CUDA built-in variable in child kernel
I'm trying to use Kepler's Dynamic Parallelism for one of my application. The global index of the thread (in the parent kernel) launching the child kernel is needed in the child kernel. In other words,...
0
votes
1
answer
1k
views
Nvidia Jetson TK1 Development Board - Cuda Compute Capability
I have quite impressed with this deployment kit. Instead of buying a new CUDA card, which might require new main board and etc, this card seems provide all in one.
At it's specs it says it has CUDA ...
0
votes
2
answers
1k
views
CMake to generate a MSVC CUDA project that targets newer devices
My PC has a GTX 580 (compute capability 2.0).
I want to compile a CUDA source that uses dynamic parallelism, a feature introduced in compute capability 3.5.
I know I will not be able to run the ...
0
votes
1
answer
246
views
CUDA dynamic parallelism is computing sequentially
I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t ...
0
votes
1
answer
2k
views
Why is cudaLaunchCooperativeKernel() returning not permitted?
So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so ...
0
votes
1
answer
178
views
Nvidia visual profiler not showing cudaMalloc() after kernel launch
I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic ...
0
votes
1
answer
215
views
cuda dynamic parallelism linkage error extern c
I'm trying to link my CUDA Kepler's Dynamic Parallelism program as follows:
nvcc -m32 -arch=sm_35 -dc -Xcompiler '-fPIC' DFS_Solving.cu
nvcc -m32 -arch=sm_35 -Xcompiler '-fPIC' -dlink DFS_Solving.o -...
0
votes
1
answer
349
views
Dynamic Parallelism in CUDA not working
I wrote a simple code to understand Dynamic Parallelism. From the values being printed,I see that the child kernel has executed correctly, but when I come back to the parent kernel, I see wrong values ...
0
votes
1
answer
1k
views
Parallelize a method from inside a CUDA device function / kernel
I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation.
So there's a kernel
__global__ void complexStuff(...)
which calls one or more times this ...
0
votes
1
answer
1k
views
How can I synchronize device-side command queues with host-side queues? clFinish() and markerWithWaitList gives invalid queue error
I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before ...
0
votes
1
answer
389
views
Using shared memory in Dynamic Parallelism CUDA
Question 1:
Do I have to specify the amount of dynamic shared memory to be allocated at the launch of parent kernel if shared memory is only used by child kernel.
Question 2:
The following is my ...
0
votes
1
answer
656
views
Cublas not working within kernel once compiled to cubin using -G flag with nvcc
I have a CUDA kernel that looks like the following:
#include <cublas_v2.h>
#include <math_constants.h>
#include <stdio.h>
extern "C" {
__device__ float ONE = 1.0f;
...
0
votes
1
answer
64
views
Do kernel-launched child kernels have the same warp size as host-launched kernels?
When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs.
...
0
votes
0
answers
45
views
CUDA Dynamic parallelism kernel synchronization on Tesla T4
I'm using Google Colab GPUs (Testa T4) to run CUDA code.
I'm in need of a way to make child kernels blocking, so that the parent kernel waits for child kernels to conclude execution before resuming ...
0
votes
1
answer
1k
views
Can I copy files from Sharepoint to Azure Blob Storage using dynamic file path?
I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work.
After reading some documentation, I was able to create a pipeline that only copies certain files.
However, I would ...
0
votes
1
answer
850
views
compile multiple cuda files (that have dynamic parallelism) and MPI code
I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make ...
0
votes
1
answer
303
views
Trouble compiling/running CUDA code involving dynamic parallelism
I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step.
I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5.
Depending on the switches ...
0
votes
1
answer
457
views
CUDA dynamic parallelism: invalid global write when using texture memory [closed]
I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value.
If the child kernel, say a different one, doesn't use texture memory, everything ...
0
votes
1
answer
313
views
How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?
So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a ...
-1
votes
1
answer
1k
views
Dynamic Parallelism - separate compilation: undefined reference to __cudaRegisterLinkedBinary
Although I have followed apendix C "Compiling Dynamic Parallelism" from "CUDA Programming Guide" and the solutions given here, I cannot manage to solve the problem I have. After the compilation and ...