* remove overloads with explicit buffer, now BufferPool is used * added async versions for all reduce functions
-DBUILD_CUDA_STUBS=ON