Without considering how memory is accessed (coalescing all accesses in a warp), ...

Without considering how memory is accessed (coalescing all accesses in a warp), your memory accesses will be up to the warp size slower (32x for NVIDIA). So basically, you can't write an even decently efficient kernel without considering this. And if you don't consider your available shared memory, that number will be much larger.

And these aren't advanced concepts, it's just the fundamental programming model for GPUs. It's on the level of "Access memory in predictable patterns, ideally sequentially" for CPUs. Everyone knows now that CPUs like arrays and sequential access. And GPUs like interleaved acccess, ideally in sizes no larger than your shared memory.