A Summary of Peculiarities in various OpenCL Implementations

[[!toc ]]

Apple

  • Caches compiled binaries based on hash of source code, not taking into account included files. Suggested workaround: Run the command: "sudo killall cvmsServ" from the command line (or in a script) to clear the cache before compiling.
  • Will not accept the preprocessor option '-DNAME=VALUE'. This is in line with the CL standard, but not common practice. Use '-D NAME=VALUE' instead.

Apple, Compiling outside Xcode

Change the compiler LDFLAGS from "-lOpenCL" to "-framework OpenCL".

Apple, CPU

Only allows one work item per work group. (mapping to one thread per CPU)

AMD, CPU

Unlike Apple's CPU implementation, AMD does allow multiple work items in a work group on the CPU. It does not appear as if that mapping is particularly efficient, but details aren't yet known.

AMD, 4xxx-generation GPUs

If barrier() is used, work group sizes cannot exceed 64 items.

Nvidia, GPU

The hardware is capable of binding samplers to linear chunks of memory to enable an extra layer of caching. This functionality is not available from OpenCL. (Note that this is less relevant on Fermi-class chips, which have more caches in all data paths.) (This is addressed in OpenCL 1.2, but Nvidia GPUs don't seem to support that.)

Similarly to Apple, caches compiled binaries based on hash of source code, not taking into account included files (may be older driver versions).