Cache coherence across accelerators lets you pretend you have one large unified memory similar to what apple does. It means you can easily share pointers between CPU and GPU with zero regard for how to move the data around and still be efficient. This is irrelevant for AI since AI does well with primitive DMA. You absolutely need CXL to have a viable OpenCL 2.2 implementation.
Here is a relevant paper: https://dl.acm.org/doi/abs/10.1145/3529336.3530817