I don't know what you mean by amazing for language. Almost everything is built on transformers nowadays. Image segmentation uses transformers. Text to speech uses transformers. Voice recognition uses transformers. There are robotics transformers that take image inputs and output motion sequences. Transformers are inherently multi-modal. They handle whatever you throw at them, it's just that language tends to be a very common input or output.