It sounds like these models think a lot, seems like the benchmarks are run with a thinking budget of 32k tokens - the full context length. (Paper's not published yet so I'm just going by what's on the website.) Still, hugely impressive if the published benchmarks hold up under real world use - the A3B in particular, outperforming QWQ, could be handy for CPU inference.
Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.
Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.