CUDA vs OpenCL: Which should I use? [Edit]
Contents
Introduction [Edit]
If you are looking to get into GPU programming, you are currently faced with an annoying choice:
I maintain two packages for accelerated computing in Python, PyCuda and PyOpenCL, so obviously I can't decide either. Still, this is a common question, so this page compiles a number of facts to help you decide. Since the question is broad and difficult as it stands, this page will focus on the Python angle when there is any benefit in doing so.
This is a Wiki page on purpose. If you think you have something to add to this discussion, please do not hesitate to click "Edit" above.
Facts [Edit]
Vendors [Edit]
As of right now, there is one vendor of CUDA implementations, namely Nvidia Corporation.
The following vendors have OpenCL implementations available:
ATI for SSE3-supporting CPUs (Intel and AMD chips are supported)
- Note: Drivers from 280.x onward self-report as supporting OpenCL 1.1. Drivers released August 10th for desktop and portable.
Apple (MacOS X only)
supports NVIDIA GeForce 8600M GT, GeForce 8800 GT, GeForce 8800 GTS, GeForce 9400M, GeForce 9600M GT, GeForce GT 120, GeForce GT 130, ATI Radeon 4850, Radeon 4870, likely more.
- supports host CPUs as compute devices
All Radeon 4xxx, 5xxx, 6xxx series are supported as well as some FirePRO and FireStream card, including some 4x00 series mobile Radeons.
Intel. CPU-only so far. GPU support likely coming with Ivy Bridge.
The following groups are or may be producing CL implementations:
Clover by Zack Rusin for the Gallium3D Linux graphics library
Code Portability [Edit]
- While OpenCL can natively talk to a large range of devices, that doesn't mean that your code will run optimally on all of them without any effort on your part. In fact, there's no guarantee it will even run at all, given that different CL devices have very different feature sets. If you stick to the OpenCL spec and avoid vendor-specific extensions, your code should be portable, if not tuned for speed. For now, it is safe to assume that you are facing efforts on the scale of a rewrite when switching devices for nontrivial codes.
Capabilities [Edit]
- OpenCL does not appear to support pinned host memory. This may cause a penalty of about a factor of two in host-device transfer rates.
Note: Well it looks like pinned host memory exists in OpenCL, with the flag CL_MEM_ALLOC_HOST_PTR (see 3.1 in the NVIDIA OpenCL Guide) -- JulianBilcke
- Oh, funny. By its original purpose, CL_MEM_ALLOC_HOST_PTR allocates device memory that is mapped into the host address space (or the other way around?). Pinned host memory doesn't necessarily have a device mapping. But quoting from the guide:
- OpenCL applications do not have direct control over whether memory objects are allocated in pinned memory or not, but they can create objects using the CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in pinned memory by the driver for best performance.
- Oh, funny. By its original purpose, CL_MEM_ALLOC_HOST_PTR allocates device memory that is mapped into the host address space (or the other way around?). Pinned host memory doesn't necessarily have a device mapping. But quoting from the guide:
- CUDA's synchronization features are not as flexible as those of OpenCL. In CL, any queued operation (memory transfer, kernel execution) can be told to wait for any other set of queued operations. CUDA's instruction streams are presently more limited. Further, OpenCL supports synchronization across multiple devices.
Partially less true as of CUDA 3.2, with the addition of cu(da)StreamWaitEvent(). CUDA still has no equivalent to CL's out-of-order queues. -- AndreasKloeckner 2010-12-14 05:30:27
- CUDA has more mature tools, including a debugger and a profiler, also CUBLAS and CUFFT. If you're a C programmer, the CUDA "runtime API" is easier to use than OpenCL, though somewhat more restricted. CUDA's "driver API" is rather similar to OpenCL.
- CUDA allows C++ constructs (templates, realistically) in GPU code, OpenCL is based on C99. (With GPU run-time code generation from PyCUDA or PyOpenCL, this is not much of a differentiator.)
- OpenCL can enqueue regular CPU function pointers in its command queues, CUDA can't.
I couldn't find how CUDA's linear-memory-bound 1D textures map into CL. Can anyone shed some light? --AndreasKloeckner
- This is less relevant on current-generation hardware that has more caching in all datapaths anyway.
OpenCL comes with run-time code generation built-in. In CUDA, you have to use tools (such as PyCUDA) to add it.
Speed [Edit]
- If you're addressing the same hardware, both frameworks should be able to achieve the same speeds. With the current beta drivers, this may not be the case, but any advantage should level out quickly.
Jack Pien took a look at the speed of AMD's CPU CL implementation.
Maintenance [Edit]
- It is not likely that either OpenCL or CUDA will disappear in short order, given existing commitments.
PyCuda and PyOpenCL will be maintained for the foreseeable future.
An Attempt at a Conclusion [Edit]
(Careful: While the above collection is supposed to consist of objective facts, this section is for personal opinion. Feel free to add yours.)
Personally, I would like to see OpenCL succeed. It has the right ingredients as a standard--mainly run-time code generation and reasonable support of heterogeneous computing. On top of that, being in a multi-vendor marketplace is a good thing--also for Nvidia, although they might not immediately see it that way.
If I was starting something new, I would likely go with OpenCL, unless I desperately needed one of the proprietary CUDA libraries.--AndreasKloeckner
If you are on Mac OS X get started with PyOpencl because installing the CUDA Framework is painful right now (summer, 2010). OpenCL comes bundled with your OS and supports more cards so starting is a snap. I agree with Andreas that learning about GPU programming is similar for both frameworks. OpenGL interoperability helped me also since I knew some stuff about OpenGL. Holger
Adding my own opinion here. I am a Game Designer from RIT. I have been using OpenCL for the last 2 months or so, and feel that I have a basic understanding of it, if not a moderate view. My boss told me to look into the development environment for CUDA, due to the fact that OpenCL is SOOOO hard to debug and get working properly. The errors sometimes do not even report the actual problem (i.e. "Out of resources exception" != "Out of bounds exception").
That being said, I also have to have a separate program to debug syntax in OpenCL. CUDA can be used straight through Visual Studio, and it has intellisense. CUDA can also use variables straight out of code, due to it being code. OpenCL is parsed as a string. The CUDA environment is much more user friendly. OpenCL has more "customizable" options for it, but this just leads to code refactoring between machines. CUDA seems to be able to port much more consistently, and its easier to work with Development Environments with CUDA. Overall, I have done OpenCL for 2 months, CUDA for 2 days, and I have had more success with CUDA.
