Also, added the missing include guard to perf_precomp.hpp. This should fix the build.
Add documentation.
Most codes are ported from AMD's Bolt library. Four methods are implemented: SORT_BITONIC, // only support power-of-2 buffer size SORT_SELECTION, // cannot sort duplicate keys SORT_MERGE, SORT_RADIX // only support signed int/float keys