Not trolling, but I'd bet something that's augmented with generative AI. Not to the level of describing scenes with words, but context-aware interpolation.
I don't want my video decoder inventing details which aren't there. I much rather want obvious compression artifacts than a codec where the "compression artifacts" look like perfectly realistic, high-quality hallucinated details.
A codec that uses AI isn't necessarily going to be using it to synthesize content. It could use it for things like improving rate-distortion optimizations and early-skip heuristics.
Modern video codecs are complex beasts. It's not as simple as "take a macroblock and find some motion vectors that minimize residual below a certain threshold, else code as intra-block". They have hundreds of mutually-exclusive techniques to compress a certain area of the frame. Determining which technique will require the smallest residual is done with fast early-skip heuristics that often make the wrong decision. The official manual for x265, the H.265/HEVC encoder (https://x265.readthedocs.io/en/master/cli.html), has literally hundreds of options, almost all about tuning these myriad of heuristics for your particular input.
AI can be used to enhance things like early-skip heuristics. "This block looks like it'll benefit from a really-detailed motion search in this particular area" or "we'll save bits if we bypass the DCT step and quantize the block directly" or "this frame should definitely be a B-frame". Encoders already use heuristics to do this (brute forcing all possible decisions to find which is optimal is too slow), but they don't always make the best decision. An AI could be used to improve that.
Now when I say AI, I'm not talking about massive, multi-billion weight monstrosities that synthesize nonsense, but extremely simple neural networks with a few thousand weights. The popular Opus codec uses a simple NN to estimate whether or not a frame of audio is speech or music, and uses that determination to decide whether to use their speech-optimized algorithm (SILK) or their music-optimized algorithm (CELT) to encode that particular frame. It's a short read but a very good one: https://jmvalin.ca/opus/opus-1.3/
This could be extended to video encoders without being used for interpolation where they would be liable to synthesize things that aren't there, DLSS-style.
In case of many textures (grass, sand, hair, skin etc) it makes little difference whether the high frequency details are reproduced exactly or hallucinated. E.g. it doesn't matter whether the 1262nd blade of grass from the left side is bending to the left or to the right.
And in the case of many others, it makes a very significant difference. And a codec doesn't have enough information to know.
Imagine a criminal investigation. A witness happened to take a video as the perpetrator did the crime. In the video, you can clearly see a recognizable detail on the perpetrator's body in high quality; a birthmark perhaps. This rules out the main suspect -- but can we trust that the birthmark actually exists and isn't hallucinated? Would a non-AI codec have just showed a clearly compression-artifact-looking blob of pixels which can't be determined one way or the other? Or would a non-AI codec have contained actual image data of the birth mark in sufficient detail?
Using AI to introduce realistic-looking details where there was none before (which is what your proposed AI codec inherently does) should never happen automatically.
> And in the case of many others, it makes a very significant difference.
This is very true, but we're talking about an entertainment provider's choice of codec for streaming to millions of subscribers.
A security recording device's choice of codec ought to be very different, perhaps even regulated to exclude codecs which could "hallucinate" high-definition detail not present in the raw camera data, and the limitations of the recording media need to be understood by law enforcement. We've had similar problems since the introduction of tape recorders, VHS and so on, they always need to be worked out. Even the phantom of Helibronn (https://en.wikipedia.org/wiki/Phantom_of_Heilbronn) turned out to be DNA contamination of swabs by someone who worked for the swab manufacturer.
I don't understand why it needs to be a part of the codec. Can't Netflix use relatively low bitrate/resolution AV1 and then use AI to upscale or add back detail in the player? Why is this something we want to do in the codec and therefore set in stone with standard bodies and hardware implementations?
We're currently indulging a hypothetical, the idea of AI being used to either improve the quality of streamed video, or provide the same quality with a lower bitrate, so the focus is what would both ends of the codec could agree on.
The coding side of "codec" needs to know what the decoding side would add back in (the hypothetical AI upscaling), so it knows where it can skimp and get a good "AI" result anyway, versus where it has to be generous in allocating bits because the "AI" hallucinates too badly to meet the quality requirements. You'd also want it specified, so that any encoding displays the same on any decoder, and you'd want it in hardware as most devices that display video rely on dedicated decoders to play it at full frame rate and/or not drain their battery. It it's not in hardware, it's not going to be adopted. It is possible to have different encodings, so a "baseline" encoding could leave out the AI upscaler, at the cost of needing a higher bitrate to maintain quality, or switching to a lower quality if bitrate isn't there.
Separating out codec from upscaler, and having a deliberately low-resolution / low-bitrate stream be naively "AI upscaled" would, IMHO, look like shit. It's already a trend in computer games to render at lower resolution and have dedicated graphics card hardware "AI upscale" (DLSS, FSR, XeSS, PSSR), because 4k resolutions are just too much work to render modern graphics consistently at 60fps. But the result, IMHO, noticibly and distractingly glitches and errors all the time.
> a codec doesn't have enough information to know.
The material belief is that modern trained neural network methods that improve on ten generations of variations of the discrete cosine transform and wavelets, can bring a codec from "1% of knowing" to "5% of knowing". This is broadly useful. The level of abstraction does not need to be "The AI told the decoder to put a finger here", it may be "The AI told the decoder how to terminate the wrinkle on a finger here". An AI detail overlay. As we go from 1080p to 4K to 8K and beyond we care less and less about individual small-scale details being 100% correct, and there are representative elements that existing techniques are just really bad at squeezing into higher compression ratios.
I don't claim that it's ideal, and the initial results left a lot to be desired in gaming (where latency and prediction is a Hard Problem), but AI upscaling is already routinely used for scene rips of older videos (from the VHS Age or the DVD Age), and it's clearly going to happen inside of a codec sooner or later.
The entire job of a codec is subjectively authentic, but lossy compression. AI is our best and in some ways easiest method of lossy compression. All lossy compression produces artifacts; JPEG macroblocks are effectively a hallucination, albeit one that is immediately identifiable because it fails to simulate anything else we're familiar with.
AI compression doesn't have to be the level of compression that exists in image generation prompts, though. A SORA prompt might be 500 bits (~1 bit per character natural English), while a decompressed 4K frame that you're trying to bring to 16K level of simulated detail starts out at 199 million bits. It can be a much finer level of compression.
Yeah, I had that case in mind actually. It's a perfect illustration of why compression artifacts should be obvious and not just realistic-looking hallucinations.
Maybe there could be a "hallucination rate" parameter in the encoder: More hallucination would enable higher subjective image quality without increased accuracy. It could be used for Netflix streaming, where birthmarks and other forensic details don't matter because it's all just entertainment. Of course the hallucination parameter needs to be hard coded somehow in the output in order to determine its reliability.
Neural codecs are indeed the future of audio and video compression. A lot of people / organizations are working on them and they are close to being practical. E.g. https://arxiv.org/abs/2502.20762
We already have some of the stepping stones for this. But honestly much better for upscaling poor quality streams vs just gives things a weird feeling when it is a better quality stream.