CUDA vs OpenCL: Which should I use?

[[!toc ]]

Introduction

If you are looking to get into GPU programming, you are currently faced with an annoying choice:

Should I base my work upon OpenCL or CUDA? I maintain two packages for accelerated computing in Python, PyCuda and PyOpenCL, so obviously I can't decide either. Still, this is a common question, so this page compiles a number of facts to help you decide. Since the question is broad and difficult as it stands, this page will focus on the Python angle when there is any benefit in doing so.

This is a Wiki page on purpose. If you think you have something to add to this discussion, please do not hesitate to click "Edit" above.

Facts

Vendors

As of right now, there is one vendor of CUDA implementations, namely Nvidia Corporation.

The following vendors have OpenCL implementations available:

AMD OpenCL 2.0 SDK for SSE3-supporting CPUs (Intel and AMD chips are supported) and AMD GPUs
All Radeon 5xxx, 6xxx, 7xxx series, R9xxx series are supported
All FirePro W-series (for workstation, active cooling) and FirePro S-series (for servers, passive heatsink)
All boards with GCN 1.0 or newer are OpenCL 2.1 ready, pending AMD OpenCL 2.1 SDK availability
All CPUs support 1.2 only
Nvidia
Note: Drivers from 280.x onward self-report as supporting OpenCL 1.2. As of September 2016, there is still no timeframe for OpenCL 2.x support
Apple (MacOS X only)
supports NVIDIA GeForce 8600M GT, GeForce 8800 GT, GeForce 8800 GTS, GeForce 9400M, GeForce 9600M GT, GeForce GT 120, GeForce GT 130, ATI Radeon 4850, Radeon 4870, likely more.
supports host CPUs as compute devices
Intel. CPU, GPU, and "MIC" (Xeon Phi).
IBM for PowerPC. Deprecated since end 2012 - use POCL instead.
Altera. FPGAs, Stratix 5. You need to load binaries as compiling takes a long time.
Xilinx. FPGAs, Ultrascale and Virtex-7.
Portable OpenCL is an open-source, LLVM-based CL implementation for CPUs. The following groups are or may be producing CL implementations:
The open-source community (Mesa) is on it, OpenCL is already part of the mesa library, and the code is apparently already able to do some rudimentary things.
Many GPUs for ARM architectures expose OpenCL.
ARM Holdings Mali GPU
Qualcomm Adreno
Imagination Technologies PowerVR
Vivante Corporation Vega

Other Free components

libclc, a library of CL C built-in functions for LLVM

Code Portability

While OpenCL can natively talk to a large range of devices, that doesn't mean that your code will run optimally on all of them without any effort on your part. In fact, there's no guarantee it will even run at all, given that different CL devices have very different feature sets. If you stick to the OpenCL spec and avoid vendor-specific extensions, your code should be portable, if not tuned for speed. For now, it is safe to assume that you are facing efforts on the scale of a rewrite for your kernel code when switching devices for nontrivial codes. Fortunately though, the host-side code stays the same across devices.

Capabilities

OpenCL does not appear to support pinned host memory. This may cause a penalty of about a factor of two in host-device transfer rates.
Note: Well it looks like pinned host memory exists in OpenCL, with the flag CL_MEM_ALLOC_HOST_PTR (see 3.1 in the NVIDIA OpenCL Guide) -- JulianBilcke
- Oh, funny. By its original purpose, CL_MEM_ALLOC_HOST_PTR allocates device memory that is mapped into the host address space (or the other way around?). Pinned host memory doesn't necessarily have a device mapping. But quoting from the guide:
  - OpenCL applications do not have direct control over whether memory objects are allocated in pinned memory or not, but they can create objects using the CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in pinned memory by the driver for best performance. --AndreasKloeckner
  - Note, the design of using CL_MEM_ALLOC_HOST_PTR while allocating a buffer puts restrictions on the amount of memory can actually be pinned. For example, the NVidia OpenCL implementation seems to limit this to CL_DEVICE_MAX_MEM_ALLOC_SIZE. While CUDA lets you pin as much physical memory as you can get ahold of. --BrianCole
    - That's arguably a bug in the Nvidia implementation, and should probably be reported... -- AndreasKloeckner 2013-03-17 18:49:24
How this is done, is explained in OpenCL Basics: Flags for the creating memory objects. The OpenCL documentation is very unclear here. --VincentHindriksen
CUDA's synchronization features are not as flexible as those of OpenCL. In CL, any queued operation (memory transfer, kernel execution) can be told to wait for any other set of queued operations. CUDA's instruction streams are presently more limited. Further, OpenCL supports synchronization across multiple devices.
Partially less true as of CUDA 3.2, with the addition of cu(da)StreamWaitEvent(). CUDA still has no equivalent to CL's out-of-order queues. -- AndreasKloeckner 2010-12-14 05:30:27
CUDA has more mature tools, including a debugger and a profiler, also CUBLAS and CUFFT. If you're a C programmer, the CUDA "runtime API" is easier to use than OpenCL, though somewhat more restricted. CUDA's "driver API" is rather similar to OpenCL.
If you're a C++ programmer, CUDA is a C API, while OpenCL provides C++ bindings natural to an object oriented programmer. There's also SYCL which could become interesting once it's available.
CUDA allows C++ constructs (templates, realistically) in GPU code, OpenCL is based on C99. (With GPU run-time code generation from PyCUDA or PyOpenCL, this is not much of a differentiator.)
Note: According to the AMD accelerated parallel processing guide at least the AMD Implementation of OpenCL now supports something they call static C++ kernels, with templates and compile time overloading.
OpenCL can enqueue regular CPU function pointers in its command queues, CUDA can't.
Isn't that what cudaStreamAddCallback does? -- dsign
I couldn't find how CUDA's linear-memory-bound 1D textures map into CL. Can anyone shed some light? --AndreasKloeckner
Fixed in OpenCL 1.2. -- AndreasKloeckner 2012-10-08 19:02:19
OpenCL comes with run-time code generation built-in. In CUDA, you have to use tools (such as PyCUDA) to add it.

Speed

If you're addressing the same hardware, both frameworks should be able to achieve the same speeds. With the current beta drivers, this may not be the case, but any advantage should level out quickly. Some early implementations of special functions aside, this has been found to be the case.

Maintenance

It is not likely that either OpenCL or CUDA will disappear in short order, given existing commitments.
PyCuda and PyOpenCL will be maintained for the foreseeable future.

An Attempt at a Conclusion

(Careful: While the above collection is supposed to consist of objective facts, this section is for personal opinion. Feel free to add yours.)

Personally, I would like to see OpenCL succeed. It has the right ingredients as a standard--mainly run-time code generation and reasonable support of heterogeneous computing. On top of that, being in a multi-vendor marketplace is a good thing--also for Nvidia, although they might not immediately see it that way.

If I was starting something new, I would likely go with OpenCL, unless I desperately needed one of the proprietary CUDA libraries. --AndreasKloeckner

If you are on Mac OS X get started with PyOpencl because installing the CUDA Framework is painful right now (summer, 2010). OpenCL comes bundled with your OS and supports more cards so starting is a snap. I agree with Andreas that learning about GPU programming is similar for both frameworks. OpenGL interoperability helped me also since I knew some stuff about OpenGL. Holger

Adding my own opinion here. I am a Game Designer from RIT. I have been using OpenCL for the last 2 months or so, and feel that I have a basic understanding of it, if not a moderate view. My boss told me to look into the development environment for CUDA, due to the fact that OpenCL is SOOOO hard to debug and get working properly. The errors sometimes do not even report the actual problem (i.e. "Out of resources exception" != "Out of bounds exception").

(To be fair, the same is true for CUDA--the corresponding error is "invalid context", and you'll have to learn that means you caused something like a segfault. Also, nowadays, AMD offers CPU CL debugging using gdb and on-GPU debugging through gDEBUGGER, so this may no longer be true.) -- AndreasKloeckner 2012-07-23 04:13:02 That being said, I also have to have a separate program to debug syntax in OpenCL. CUDA can be used straight through Visual Studio, and it has intellisense. CUDA can also use variables straight out of code, due to it being code. OpenCL is parsed as a string. The CUDA environment is much more user friendly. OpenCL has more "customizable" options for it, but this just leads to code refactoring between machines. CUDA seems to be able to port much more consistently, and its easier to work with Development Environments with CUDA. Overall, I have done OpenCL for 2 months, CUDA for 2 days, and I have had more success with CUDA.
(You don't have to stick CL code into a string, you can read it from a separate file, which any sensible IDE will treat as valid C. PyOpenCL ships with a syntax file that highlights CL within a Python file.) -- AndreasKloeckner 2012-07-23 04:13:02 It's currently very difficult to ship binary OpenCL code (maybe necessary for a closed source company). CUDA allows you to compile to a single binary they guarantee various levels of compatibility based upon compiler flags. OpenCL is working on a new specification to alleviate this problem (and enable other cool stuff) called OpenCL-SPIR. http://codedivine.org/2012/10/07/opencl-spir/

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search