WordNet (as you probably know) is a database that groups English words into a set of synonyms. If you consider WordNet as a clustering of high-level classes, then you could argue that ImageNet is the "WordNet for vision", meaning the clustering of object classes.
The article uses a different meaning of ImageNet, namely ImageNet as pretraining task that can be used to learn representations that will likely be beneficial for many other tasks in the problem space. In this sense, you could use WordNet as an "ImageNet for language" e.g. by learning word representations based on the WordNet definitions. This is something people have done, but there are a lot more effective approaches.
I hope this helped and was not too convoluted.