As I posted in a comment on the article, he seems to have unrealistic expectations for an x86 assembler. TI's DSPs have a nice assembly because of their architecture. The x86 architecture is too complicated and the implementations too diverse to have an assembler like he wants.
I'd put it another way: on VLIW DSPs they HAD to provide a good assembler, or they would be too difficult to program. From what I remember on some of them you have to manually manage the pipeline -- e.g. a jump instruction needs to occur several instructions BEFORE the actual jump, because by the time that instruction gets executed, several instruction after it will already have been decoded. You really don't want to do that manually.
The only non-DSP processor I know that required programmers to schedule jumps ahead of time manually was the Intel i860, and it WAS too difficult to program, that's why it disappeared.
With regard to PSRLB, PSRLW, PMADDUBSW, and PMOVMSKB, I must say I loved assembly much more in the times when each instruction mnemonic was only 2-3 characters.
Oh, the real fun begins with SSE4.2 and things like PCMPISTRM (the whole PCMPxSTRx family), where you not only have the mnemonics, but also an 8-bit immediate with each bit specifying a different aspect of operation for the string comparison engine.