Skip to main content

Questions tagged [dynamic-parallelism]

dynamic parallelism refers to a capability in CUDA for device kernel launches to be performed from within a device kernel

dynamic-parallelism
Filter by
Sorted by
Tagged with
9 votes
1 answer
9k views

compilation .cu files with Dynamic Parallelism(CUDA)

I switched to a new GPU GeForce GTX 980 with cc 5.2, so it must support dynamic parallelism. However, I was not able to compile even a simple code (from programming guide). I will not provide it here (...
Mikhail Genkin's user avatar
7 votes
2 answers
6k views

CUDA Dynamic Parallelism, bad performance

We are having performance issues when using the CUDA Dynamic Parallelism. At this moment, CDP is performing at least 3X slower than a traditional approach. We made the simplest reproducible code to ...
Cristobal Navarro's user avatar
7 votes
1 answer
618 views

CUDA device runtime api cudaMemsetAsync doesn't work

I am trying to call cudaMemsetAsync from kernel (so called "dynamic parallelism"). But no matter what value I use, it always set memory to 0. Here is my test code: #include "cuda_runtime.h" #include ...
Xiang Zhang's user avatar
  • 2,933
7 votes
0 answers
196 views

AleaGPU Dynamic Parallelism in F#? How?

This might be a simple question, but I have not been able to find any references to this topic: How do I launch a kernel from within another kernel?. The only relevant example I came across is the ...
Alex Gheith's user avatar
3 votes
1 answer
956 views

CUDA dynamic parallelism with Driver API

I'm trying to compile and link a dynamic kernel and use it with the CUDA driver API on a GK110. I compile the .cu source file in Visual Studio with the relocatable device code flag and compute_35, ...
FHoenig's user avatar
  • 359
3 votes
1 answer
189 views

Dynamic Parallelism on GTX 980 ti: Unknown Error

I am attempting dynamic parallelism on a GTX 980 ti card. All attempts at running code return "unknown error". Simple code is shown below with compilation options. I can execute kernels at depth=0 ...
AshleyG's user avatar
  • 31
3 votes
1 answer
371 views

What factors effect the overhead of dynamic parallelism kernel launches?

When you launch a secondary kernel from within a primary one on a GPU, there's some overhead. What are the factors contributing or affecting the amount of this overhead? e.g. size of the kernel code, ...
einpoklum's user avatar
  • 127k
3 votes
1 answer
952 views

How to perform relational join on two data containers on GPU (preferably CUDA)?

What I'm trying to do: On the GPU, I'm trying to mimic the conventions used by SQL in relational algebra to perform joins on tables (e.g. Inner Join, Outer Join, Cross Join). In the code below, I'm ...
aiwyn's user avatar
  • 278
2 votes
1 answer
2k views

CUDA - How to make thread in kernel wait for it's children

I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait ...
Eugênio Fonseca's user avatar
2 votes
1 answer
3k views

Dynamic parallelism - launching many small kernels is very slow

I am trying to use dynamic parallelism to improve an algorithm I have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What I want to do is to first ...
labotsirc's user avatar
  • 722
2 votes
3 answers
1k views

Kepler CUDA dynamic parallelism and thread divergence

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved? ...
HooYao's user avatar
  • 554
2 votes
1 answer
519 views

"unknown error" on first cudaMalloc if CUBLAS is present in kernel

I have the following minimal .cu file #include <cuda_runtime_api.h> #include <cublas_v2.h> #include <cstdio> __global__ void test() { cublasHandle_t handle = nullptr; ...
Joe's user avatar
  • 6,687
2 votes
0 answers
203 views

Does nvcc support tail call optimization in dynamic parallelism?

Under the CUDA Programming Guide section C.4.3.1.2. "Nesting and Synchronization Depth", it is mentioned: "An optimization is permitted where the system detects that it need not reserve space for ...
peteraldaron's user avatar
1 vote
2 answers
1k views

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,...
John's user avatar
  • 3,087
1 vote
1 answer
2k views

CUDA recursion depth

When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah. My question is how many ...
AmirSojoodi's user avatar
  • 1,290
1 vote
2 answers
1k views

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7. My code is so simple. When I try to compile ...
grypp's user avatar
  • 435
1 vote
1 answer
2k views

Is it possible to call cublas functions from a device function?

In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas ...
emartel's user avatar
  • 49
1 vote
1 answer
1k views

How to compile a .cu with dynamic parallelism? [closed]

I have 2 cpp files setup and functions, 6 .cu files main, flood, timestep, discharge, continuity and copy. I'm trying to compile this to the main call the cpp files and so the flood kernel global and ...
Seffrin's user avatar
  • 23
1 vote
1 answer
543 views

Why can't I link to my CUDA static library that uses Dynamic Parallelism and Separable Compilation?

I'm trying to create the most basic CUDA application to demonstrate Dynamic Parallelism, Separate Compilation and Linking, a CUDA kernel in a static library, and I'm trying to use CMake to generate a ...
Justin's user avatar
  • 1,919
1 vote
1 answer
945 views

How to call a Thrust function in a stream from a kernel?

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be ...
heapoverflow's user avatar
1 vote
1 answer
757 views

CL_OUT_OF_RESOURCES error is returned by clEnqueueNDRangeKernel() with dynamic parallelism

Kernel codes that produce the error: __kernel void testDynamic(__global int *data) { int id=get_global_id(0); atomic_add(&data[1],2); } __kernel void test(__global int * data) { int ...
huseyin tugrul buyukisik's user avatar
1 vote
1 answer
1k views

numba.typeinfer.TypingError: Untyped global name 'child_launch' when using CUDA Dynamic Parallelism in Python ( Anaconda ) on NVIDIA GPU

My code is here: import numpy as np from numbapro import cuda @cuda.autojit def child_launch(data): data[cuda.threadIdx.x] = data[cuda.threadIdx.x] + 100 @cuda.autojit def parent_launch(data): ...
Ethan Huang's user avatar
1 vote
0 answers
6 views

CUDA dynamic parallelism -- Is there a way to infinitely nest kernel launches?

Now, I'm using CUDA dynamic parallelism to create the kernel in a kernel function. In the CUDA document, kernel functions can only be launched a fixed recursion depth because of resource constraints. ...
Frostmourne's user avatar
1 vote
3 answers
3k views

Generating Relocatable Device Code using Nvidia Nsight

I'm trying to compile a dynamic parallelism example on CUDA and when i try to compile it gives and error saying, kernel launch from __device__ or __global__ functions requires separate compilation ...
BAdhi's user avatar
  • 440
0 votes
1 answer
907 views

Synchronization in CUDA dynamic parallelism

I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion: __global__ void getMax(...
Matias Haeussler's user avatar
0 votes
1 answer
340 views

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this: int aPayloads[32]; // Compute aPayloads start values here ...
Silicomancer's user avatar
  • 8,966
0 votes
1 answer
536 views

Can a CUDA parent kernel launch a child kernel with more threads than the parent?

I'm trying to learn how to use CUDA Dynamic Parallelism. I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with ...
Justin's user avatar
  • 1,919
0 votes
1 answer
460 views

Synchronizing depth of nested kernels

Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then ...
user2255757's user avatar
0 votes
1 answer
1k views

Dynamic parallelism - passing contents of shared memory to spawned blocks?

While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able ...
einpoklum's user avatar
  • 127k
0 votes
1 answer
876 views

"device-function-maxrregcount" message while compiling cuda code

I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code: using ...
starrr's user avatar
  • 1,013
0 votes
1 answer
164 views

Accessing CUDA built-in variable in child kernel

I'm trying to use Kepler's Dynamic Parallelism for one of my application. The global index of the thread (in the parent kernel) launching the child kernel is needed in the child kernel. In other words,...
user3813674's user avatar
  • 2,623
0 votes
1 answer
1k views

Nvidia Jetson TK1 Development Board - Cuda Compute Capability

I have quite impressed with this deployment kit. Instead of buying a new CUDA card, which might require new main board and etc, this card seems provide all in one. At it's specs it says it has CUDA ...
phoad's user avatar
  • 1,831
0 votes
2 answers
1k views

CMake to generate a MSVC CUDA project that targets newer devices

My PC has a GTX 580 (compute capability 2.0). I want to compile a CUDA source that uses dynamic parallelism, a feature introduced in compute capability 3.5. I know I will not be able to run the ...
Marco's user avatar
  • 131
0 votes
1 answer
246 views

CUDA dynamic parallelism is computing sequentially

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t ...
Daniil Tarpanov's user avatar
0 votes
1 answer
2k views

Why is cudaLaunchCooperativeKernel() returning not permitted?

So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so ...
abhishekpurandare1297's user avatar
0 votes
1 answer
178 views

Nvidia visual profiler not showing cudaMalloc() after kernel launch

I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic ...
progammer's user avatar
  • 2,002
0 votes
1 answer
215 views

cuda dynamic parallelism linkage error extern c

I'm trying to link my CUDA Kepler's Dynamic Parallelism program as follows: nvcc -m32 -arch=sm_35 -dc -Xcompiler '-fPIC' DFS_Solving.cu nvcc -m32 -arch=sm_35 -Xcompiler '-fPIC' -dlink DFS_Solving.o -...
andersonbp's user avatar
0 votes
1 answer
349 views

Dynamic Parallelism in CUDA not working

I wrote a simple code to understand Dynamic Parallelism. From the values being printed,I see that the child kernel has executed correctly, but when I come back to the parent kernel, I see wrong values ...
Jagannath's user avatar
0 votes
1 answer
1k views

Parallelize a method from inside a CUDA device function / kernel

I've got an already parallelized CUDA kernel that does some tasks which require frequent interpolation. So there's a kernel __global__ void complexStuff(...) which calls one or more times this ...
user3249755's user avatar
0 votes
1 answer
1k views

How can I synchronize device-side command queues with host-side queues? clFinish() and markerWithWaitList gives invalid queue error

I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before ...
huseyin tugrul buyukisik's user avatar
0 votes
1 answer
389 views

Using shared memory in Dynamic Parallelism CUDA

Question 1: Do I have to specify the amount of dynamic shared memory to be allocated at the launch of parent kernel if shared memory is only used by child kernel. Question 2: The following is my ...
Aliya Clark's user avatar
0 votes
1 answer
656 views

Cublas not working within kernel once compiled to cubin using -G flag with nvcc

I have a CUDA kernel that looks like the following: #include <cublas_v2.h> #include <math_constants.h> #include <stdio.h> extern "C" { __device__ float ONE = 1.0f; ...
Bam4d's user avatar
  • 610
0 votes
1 answer
64 views

Do kernel-launched child kernels have the same warp size as host-launched kernels?

When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs. ...
mmdanziger's user avatar
  • 4,608
0 votes
0 answers
45 views

CUDA Dynamic parallelism kernel synchronization on Tesla T4

I'm using Google Colab GPUs (Testa T4) to run CUDA code. I'm in need of a way to make child kernels blocking, so that the parent kernel waits for child kernels to conclude execution before resuming ...
Francesco Ostidich's user avatar
0 votes
1 answer
1k views

Can I copy files from Sharepoint to Azure Blob Storage using dynamic file path?

I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work. After reading some documentation, I was able to create a pipeline that only copies certain files. However, I would ...
CuteeeeRabbit's user avatar
0 votes
1 answer
850 views

compile multiple cuda files (that have dynamic parallelism) and MPI code

I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make ...
user2330963's user avatar
0 votes
1 answer
303 views

Trouble compiling/running CUDA code involving dynamic parallelism

I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step. I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5. Depending on the switches ...
VincentN's user avatar
0 votes
1 answer
457 views

CUDA dynamic parallelism: invalid global write when using texture memory [closed]

I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value. If the child kernel, say a different one, doesn't use texture memory, everything ...
salvaS's user avatar
  • 13
0 votes
1 answer
313 views

How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a ...
yugi957's user avatar
-1 votes
1 answer
1k views

Dynamic Parallelism - separate compilation: undefined reference to __cudaRegisterLinkedBinary

Although I have followed apendix C "Compiling Dynamic Parallelism" from "CUDA Programming Guide" and the solutions given here, I cannot manage to solve the problem I have. After the compilation and ...
emartel's user avatar
  • 49