Besides of being too slow to run on current phones, and not natively supporting 3D acceleration, these are the reasons mentioned in the article:
"With respect to shell development (Unity), three major shortcomings of the X stack prevent us from delivering the user experience (f’n’f) we have in mind:
* X shares a lot of system state across process boundaries. This is obviously not a problem in itself but a system-level UI that is meant to provide a beautiful and consistent user experience is likely to require tight control over the overall system state.
* X's input model is complex and allows applications to spoof on input events they do not own. On the one hand, this raises serious security concerns, especially regarding mobile platforms. On the other hand, adjusting and extending X's input model is difficult and supporting features like input event batching and compression, motion event prediction together with associated power-saving strategies or flexible synchronization schemes for aligning input event delivery and rendering operations is (too) complex.
* The compositor hierarchy ends on the session level, and no tight integration into the system from boot time onward is available. For that reason, there is a visible glitch when transitioning the system from a VT-level to the graphical shell level."
Also, it says something when #3 on your big list of complaints about X is that switching graphics modes when starting X is a major problem that needs to be fixed.
"With respect to shell development (Unity), three major shortcomings of the X stack prevent us from delivering the user experience (f’n’f) we have in mind:
* X shares a lot of system state across process boundaries. This is obviously not a problem in itself but a system-level UI that is meant to provide a beautiful and consistent user experience is likely to require tight control over the overall system state.
* X's input model is complex and allows applications to spoof on input events they do not own. On the one hand, this raises serious security concerns, especially regarding mobile platforms. On the other hand, adjusting and extending X's input model is difficult and supporting features like input event batching and compression, motion event prediction together with associated power-saving strategies or flexible synchronization schemes for aligning input event delivery and rendering operations is (too) complex.
* The compositor hierarchy ends on the session level, and no tight integration into the system from boot time onward is available. For that reason, there is a visible glitch when transitioning the system from a VT-level to the graphical shell level."