A c++ Cuda kernel for a transpose function is to be programmed. The kernel should be state of the art, which means maybe more than just plain copy (e.g. if useful, with coalescent memory access for both read and writing). The main goal is speed!!! It will be used with an RTX 2080 card. The input data consists of a 2D-array (an image) with float numbers. Mainly, the size of X and Y are not equal, not a multiple of 256 and varying.
An example of how to use the kernel needs to be given, e.g. load an image with the Nvidia-SDK, transpose it and save it.