Frequently Asked Questions about PyCUDA

[[!toc ]]

General Questions

Should I use PyOpenCL or PyCUDA?

Good question. I put together a page that presents arguments that help you decide. The answer will likely depend on your particular situation. In most cases, "it doesn't matter" is probably the correct answer.

Build Questions

How do I make PyCUDA rebuild itself from scratch?

Just the delete the build subdirectory created during compilation:

rm -Rf build

Then restart compilation:

python install

I have with gcc 4.1 or older. Help!

Try adding:


to your pycuda/ or $HOME/

It can also help to also use DBOOST_NO_INCLASS_MEMBER_INITIALIZATION, i.e. with:


When I run my first code, I get an ImportError?

I.e., if you get an error like this:

ImportError: cannot open
shared object file: No such file or directory

you should take a look at BoostInstallationHowto#LD_LIBRARY_PATH.

If your error looks more like this:

ImportError: /usr/lib/python2.5/site-packages/pycuda-0.93rc4-py2.5-
linux-i686.egg/pycuda/ undefined symbol:

then you're likely mixing the headers of one version of boost with the libraries of another. Another likely indication of this problem is that BOOST_INC_DIR and BOOST_LIB_DIR use different prefixes.

BOOST_INC_DIR = ['/usr/include']
BOOST_LIB_DIR = ['/usr/local/lib']

Often, this will be caused by Boost development packages being installed system-wide whose headers then get picked up by accident.

I'm getting funny setuptools errors, like KeyError: '_driver'. Why?

See DistributeVsSetuptools. Likely you're on Python 2.6.3.

Usage Questions

How about multiple GPUs?

Two ways:

  • Allocate two contexts, juggle (pycuda.driver.Context.push and pycuda.driver.Context.pop) them from that one process.
  • Work with several processes or threads, using MPI, multiprocesing or threading. As of Version 0.90.2, PyCUDA will release the [[!wikipedia Global Interpreter Lock desc="Global Interpreter Lock"]] while it is waiting for CUDA operations to finish. As of version 0.93, PyCUDA will actually work when used together with threads. Also see threading, below.

My program terminates after a launch failure. Why?

This should not be an issue any more with 0.93 and later, where cleanup failures have been downgraded to warnings.

You're probably seeing something like this::

Traceback (most recent call last):
  File "", line 32, in <module>
    cuda.memcpy_dtoh(a_doubled, a_gpu)
RuntimeError: cuMemcpyDtoH failed: launch failed
terminate called after throwing an instance of 'std::runtime_error'
  what():  cuMemFree failed: launch failed
zsh: abort      python

What's going on here? First of all, recall that launch failures in CUDA are asynchronous. So the actual traceback does not point to the failed kernel launch, it points to the next CUDA request after the failed kernel.

Next, as far as I can tell, a CUDA context becomes invalid after a launch failure, and all following CUDA calls in that context fail. Now, that includes cleanup (see the cuMemFree in the traceback?) that PyCUDA tries to perform automatically. Here, a bit of PyCUDA's C++ heritage shows through. While performing cleanup, we are processing an exception (the launch failure reported by cuMemcpyDtoH). If another exception occurs during exception processing, C++ gives up and aborts the program with a message.

In principle, this could be handled better. If you're willing to dedicate time to this, I'll likely take your patch.

Are the CUBLAS APIs available via PyCUDA?

No. I would be more than happy to make them available, but that would be mostly either-or with the rest of PyCUDA, because of the following sentence in the CUDA programming guide:

     * [CUDA] is composed of two APIs: 
        * A low-level API called the CUDA driver API, 
        * A higher-level API called the CUDA runtime API that is implemented on top of the CUDA driver API. These APIs are mutually exclusive: An application should use either one or the other.

PyCUDA is based on the driver API. CUBLAS uses the high-level API. One can violate this rule without crashing immediately. But sketchy stuff does happen. Instead, for BLAS-1 operations, PyCUDA comes with a class called [[!PyCudaDoc array desc="GPUArray"]] that essentially reimplements that part of CUBLAS.

If you dig into the history of PyCUDA, you'll find that, at one point, I did have rudimentary CUBLAS wrappers. I removed them because of the above issue. If you would like to make CUBLAS wrappers, feel free to use these rudiments as a starting point. That said, Arno Pähler's python-cuda has complete ctypes-based wrappers for CUBLAS. I don't think they interact natively with numpy, though.

I've found some nice undocumented function in PyCUDA. Can I use it?

Of course you can. But don't come whining if it breaks or goes away in a future release. Being open-source, neither of these two should be show-stoppers anyway, and we welcome fixes for any functionality, documented or not.

The rule is that if something is documented, we will in general make every effort to keep future version backward compatible with the present interface. If it isn't, there's no such guarantee.

Does PyCUDA automatically activate the right context for the object I'm talking to?

No. It does know which context each object belongs, and it does implicitly activate contexts for cleanup purposes. Since I'm not entirely sure how costly context activation is supposed to be, PyCUDA will not juggle contexts for you if you're talking to an object from a context that's not currently active. Here's a rule of thumb: As long as you have control over invocation order, you have to manage contexts yourself. Since you mostly don't have control over cleanup, PyCUDA manages contexts for you in this case. To make this transparent to you, the user, PyCUDA will automatically restore the previous context once it's done cleaning up.

How does PyCUDA handle threading?

As of version 0.93, PyCUDA supports threading. There is an example of how this can be done in examples/ in the PyCUDA distribution. (The current git repo does not have examples/ Here is a direct link to a previous version: When you use threading in PyCUDA, you should be aware of one peculiarity, though. Contexts in CUDA are a per-thread affair, and as such all contexts associated with a thread as well as GPU memory, arrays and other resources in that context will be automatically freed when the thread exits. PyCUDA will notice this and will not try to free the corresponding resource--it's already gone after all.

There is another, less intended consequence, though: If Python's garbage collector finds a PyCUDA object it wishes to dispose of, and PyCUDA, upon trying to free it, determines that the object was allocated outside of the current thread of execution, then that object is quietly leaked. This properly handles the above situation, but it mishandles a situation where:

  • You use reference cycles in a GPU driver thread, necessitating the GC (over just
    • regular reference counts).
  • You require cleanup to be performed before thread exit.
  • You rely on PyCUDA to perform this cleanup. To entirely avoid the problem, do one of the following:

  • Use multiprocessing instead of threading.

  • Explicitly call free on the objects you want cleaned up.

How do I specify the correct types when calling and preparing PyCUDA functions?

When calling a CUDA kernel directly (via __call__) or when "preparing" the function, the following mapping between C data types and CUDA types holds:

  • (unsigned) char = numpy.(u)int8
  • (unsigned) short = numpy.(u)int16
  • (unsigned) int = numpy.(u)int32
  • (unsigned) long = numpy.(u)int64 (only 64-bit)
  • floats = numpy.float32
  • double = numpy.float64
  • all pointers ( e.g int *, float ***, anything at all) should be numpy.intp. The function implements this mapping.

Handles returned by device memory allocation functions can be cast to numpy.intp. There is no distinction between pointed-to types as far as PyCUDA is concerned. E.g. passing a handle to a float * for an argument that expects and int * will result in undefined kernel behavior.

For example, if your kernel looks like this:

kernel = "__global__ foo( float **a, int b, float c ) {...}" 

then your python code for a prepared invocation should resemble:

gpu_source_module = pycuda.driver.SourceModule(kernel)
gpu_foo = gpu_source_module.get_function("foo")
# Alternatively, one may use the `struct` module's characters for type specification:
gpu_foo.prepare("Pif", block=(nthreads_x,nthreads_x,nthreads_x))

and invoking the kernel should look something like:

gpu_foo.prepared_call( (nblocks_x,nblocks_y), 
  <handle_to_allocated_float**_on_gpu>, 58, 5.923)

Where the first tuple is the dimensions of your block grid, and the remaining arguments are your kernel arguments. Observe that there's no need to explictly cast the arguments to a prepared invocation.

On the other hand, if you'd like to go with direct (i.e. unprepared) invocation, your call should include explicit casts:

gpu_foo(<handle_to_allocated_float**_on_gpu>, numpy.int32(58), numpy.float32(5.923),

Is it possible to use cuda-gdb with PyCUDA?

Yes! As of version 0.94.1, support for this is built right into PyCUDA. This transcript shows what you need to do debug the script in the PyCUDA examples folder. Note that you need to start Python with the extra switch -m pycuda.debug.

$ cuda-gdb --args python -m pycuda.debug
NVIDIA (R) CUDA Debugger
3.2 release
Portions Copyright (C) 2008-2010 NVIDIA Corporation
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib/".

Now set a breakpoint:

(cuda-gdb) b doublify
Function "doublify" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (doublify) pending.

and run the program:

(cuda-gdb) r
Starting program: /home/andreas/src/env/bin/python -m pycuda.debug
*** compiler output in /tmp/tmpphwQWh
Breakpoint 2 at 0x2cbb1d0: file, line 5.
Pending breakpoint "doublify" resolved
[Launch of CUDA Kernel 0 (doublify) on Device 0]
[Switching to CUDA Kernel 0 (<<<(0,0),(0,0,0)>>>)]

Breakpoint 2, doublify<<<(1,1),(4,4,1)>>> (a=0x100000) at
5             int idx = threadIdx.x + threadIdx.y*4;
Current language:  auto; currently c++

We are now debugging inside the kernel, where we may step, examine data, or set further breakpoint. See the gdb manual and Nvidia's documentation for details.

Note that you cannot debug the host-side code using gdb. I recommend pudb for that job.

Is it possible to profile CUDA code with PyCUDA?

Yes! When you set environment variable CUDA_PROFILE to 1, CUDA creates log files called cuda_profile_NN.log which contain performance informations about kernels that were run. You must run the python script in the following way:


After running this code in current directory there will be file cuda_profile_0.log containing names of kernels that were run and time it took to run each kernel, both on CPU and on GPU:

method=[ md5_rng_float ] gputime=[ 40.096 ] cputime=[ 13.000 ] occupancy=[ 0.500 ]
method=[ fill ] gputime=[ 3.840 ] cputime=[ 9.000 ] occupancy=[ 0.500 ]
method=[ memcpyDtoH ] gputime=[ 19.648 ] cputime=[ 54.000 ]
method=[ prefix_kernel ] gputime=[ 18.144 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]
method=[ prefix_kernel ] gputime=[ 18.720 ] cputime=[ 5.000 ] occupancy=[ 0.667 ]
method=[ fill ] gputime=[ 2.976 ] cputime=[ 6.000 ] occupancy=[ 0.021 ]
method=[ memcpyDtoH ] gputime=[ 1.280 ] cputime=[ 13.000 ]
method=[ memcpyDtoH ] gputime=[ 1.088 ] cputime=[ 12.000 ]
method=[ memcpyDtoH ] gputime=[ 1.056 ] cputime=[ 12.000 ]
method=[ memcpyDtoH ] gputime=[ 1.056 ] cputime=[ 14.000 ]
method=[ memcpyDtoH ] gputime=[ 2.048 ] cputime=[ 15.000 ]
method=[ memcpyDtoH ] gputime=[ 1.088 ] cputime=[ 12.000 ]
method=[ memcpyDtoH ] gputime=[ 1.056 ] cputime=[ 12.000 ]
method=[ memcpyHtoD ] gputime=[ 0.832 ] cputime=[ 3.000 ]
method=[ finish ] gputime=[ 9.600 ] cputime=[ 5.000 ] occupancy=[ 0.667 ]
method=[ memcpyDtoH ] gputime=[ 23.584 ] cputime=[ 76.000 ]

How do I use the NVidia Visual Profiler with PyCUDA applications?

In the Visual Profiler, create a new session with File set to your Python executable and Arguments set to your main Python script.

There is one more important change you have to make before the timeline will show any events. Before your application quits, make the call


so that the profiling data buffers get flushed to file.

Does PyCUDA support emulation mode?

No. Also, since Nvidia has declared emulation mode 'deprecated', we have no intention of adding it.

System-specific Questions


My compiler cache gets deleted on every reboot. How do I keep that from happening?

On Debian (and possibly Ubuntu?), edit the file /etc/default/rcS and change


to the number of days that you'd like to keep files in /tmp around. "30" works for me.