It's analog though. Presumably the shape of the image matters, like horizontal lines are easier than vertical, it's not just a bitmap. He made the point of how many KB you can store in the song, but is it right? There are different conceivable ways to store binary data in that. I have no idea how efficient it'd be to get something 99% reliable.
He said 176KB of entropy in that 1-second birdsong, which doesn't seem close. That's more than the bitrate of a typical M4A, for a much simpler sound.
Thinking about it in reverse, how much data would it take to encode 1 second of birdsong in the most efficient audio codec I can imagine. If M4A or MP3 with the bitrate slammed way down isn't a fair comparison, then some birdsong-specific ML autoencoder... Probably 500 bytes? Would still be enough for a Twitter tweet.