A Summary of Peculiarities in various OpenCL Implementations
Contents
Apple, CPU
Only allows one work item per work group. (mapping to one thread per CPU)
AMD, CPU
Unlike Apple's CPU implementation, AMD does allow multiple work items in a work group on the CPU. It does not appear as if that mapping is particularly efficient, but details aren't yet known.
AMD, 4xxx-generation GPUs
If barrier() is used, work group sizes cannot exceed 64 items.
Nvidia, GPU
The hardware is capable of binding samplers to linear chunks of memory to enable an extra layer of caching. This functionality is not available from OpenCL. (Note that this is less relevant on Fermi-class chips, which have more caches in all data paths.)
