CUDA vs OpenCL: Which should I use?
Contents
Introduction
If you are looking to get into GPU programming, you are currently faced with an annoying choice:
I maintain two packages for accelerated computing in Python, PyCuda and PyOpenCL, so obviously I can't decide either. Still, this is a common question, so this page compiles a number of facts to help you decide. Since the question is broad and difficult as it stands, this page will focus on the Python angle when there is any benefit in doing so.
This is a Wiki page on purpose. If you think you have something to add to this discussion, please do not hesitate to click "Edit" above.
Facts
Vendors
As of right now, there is one vendor of CUDA implementations, namely Nvidia Corporation.
The following vendors have OpenCL implementations available:
ATI for SSE3-supporting CPUs (Intel and AMD chips are supported)
Nvidia (only in beta drivers at the moment)
Apple (MacOS X only)
supports NVIDIA GeForce 8600M GT, GeForce 8800 GT, GeForce 8800 GTS, GeForce 9400M, GeForce 9600M GT, GeForce GT 120, GeForce GT 130, ATI Radeon 4850, Radeon 4870.
- supports host CPUs as devices
ATI for AMD's GPUs (beta version of the development platform released on october 14)
All Radeon 4x00 and 5x00 series are supported as well as some FirePRO and FireStream card, including some 4x00 series mobile Radeons.
- The downside is that you actually need to use some beta drivers as this SDK won't work on your currently installed Catalyst.
The following groups are or may be producing CL implementations:
Clover by Zack Rusin for the Gallium3D Linux graphics library
Code Portability
- While OpenCL can natively talk to a large range of devices, that doesn't mean that your code will run optimally on all of them without any effort on your part. In fact, there's no guarantee it will even run at all, given that different CL devices have very different feature sets. For now, it is safe to assume that you are facing efforts on the scale of a rewrite when switching devices for nontrivial codes.
Capabilities
- OpenCL does not appear to support pinned host memory. This may cause a penalty of about a factor of two in host-device transfer rates.
Note: Well it looks like pinned host memory exists in OpenCL, with the flag CL_MEM_ALLOC_HOST_PTR (see 3.1 in the NVIDIA OpenCL Guide) -- JulianBilcke
- Oh, funny. By its original purpose, CL_MEM_ALLOC_HOST_PTR allocates device memory that is mapped into the host address space (or the other way around?). Pinned host memory doesn't necessarily have a device mapping. But quoting from the guide:
- OpenCL applications do not have direct control over whether memory objects are allocated in pinned memory or not, but they can create objects using the CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in pinned memory by the driver for best performance.
- Oh, funny. By its original purpose, CL_MEM_ALLOC_HOST_PTR allocates device memory that is mapped into the host address space (or the other way around?). Pinned host memory doesn't necessarily have a device mapping. But quoting from the guide:
- CUDA's synchronization features are not as flexible as those of OpenCL. In CL, any queued operation (memory transfer, kernel execution) can be told to wait for any other set of queued operations. CUDA's instruction streams are presently more limited. Further, OpenCL supports synchronization across multiple devices.
- CUDA has more mature tools, including a debugger and a profiler, also CUBLAS and CUFFT. If you're a C programmer, the CUDA "runtime API" is easier to use than OpenCL, though somewhat more restricted. CUDA's "driver API" is rather similar to OpenCL.
PyCuda is somewhat more mature than PyOpenCL at the moment, though I expect this to level out soon. (even faster if you help)
- CUDA allows C++ constructs (templates, realistically) in GPU code, OpenCL is based on C99. (With GPU run-time code generation from PyCUDA or PyOpenCL, this is not much of a differentiator.)
- OpenCL can enqueue regular CPU function pointers in its command queues, CUDA can't.
CL can write to textures, CUDA can't. (Does this work on Nvidia hardware? --AndreasKloeckner)
I couldn't find how CUDA's wonderfully useful linear-memory-bound 1D textures map into CL. Can anyone shed some light? --AndreasKloeckner
Speed
- If you're addressing the same hardware, both frameworks should be able to achieve the same speeds. With the current beta drivers, this may not be the case, but any advantage should level out quickly.
Jack Pien took a look at the speed of AMD's CPU CL implementation.
Maintenance
- It is not likely that either OpenCL or CUDA will disappear in short order, given existing commitments.
PyCuda and PyOpenCL will be maintained for the foreseeable future.
An Attempt at a Conclusion
(Careful: While the above collection is supposed to consist of objective facts, this section is for personal opinion. Feel free to add yours.)
At the moment, I don't believe there is a "right" and a "wrong" choice. OpenCL and CUDA (and thereby PyOpenCL and PyCuda) both only provide the "mechanics" for GPU access. You'll likely spend much more time learning about the intricacies of the hardware, rather than these mechanics, and the hardware knowledge is easily portable between frameworks. Personally, I would be careful to not put too much weight on the decision for now. --AndreasKloeckner
