First model I've seen that was consistently compositional, easily handling requests like
“Generate an image of an african elephant painted in the New England flag, doing a backflip in front of the russian federal assembly.”
OpenAI made the biggest step change towards compositionality in image generation when they started directly generating image tokens for decoders from foundation llms, and it worked very well (openais images were better in this regard than nano banana 1, but struggled with some OOD images like elephants doing backflips), but banana 2 nails this stuff in a way I haven't seen anywhere else
if video follows the same trends as images in terms of prompt adherence, that will be very valuable... and interesting
Shameless plug of personal blog post, but relevant. Still not fully edited, so writing is a bit scattered, but crux is we now have the framework for talking about consciousness intelligently. It's not as mysterious as in the past, considering advances in non-equilibrium thermodynamics and the Free Energy Principle in particular.
TLDR: Internal LLM representations correspond to an understanding of the visual world. We've all seen the Othello example, which is too constrained a world to mean much, but even more interesting is that LLMs can caption tokenized images with no pretraining on visual tasks whatsoever. Specifically, pass an image to an encoder-decoder visual model trained in a completely unsupervised manner on images -> take the encoded representation -> pass the encoded representation to an LLM as tokens -> get accurate captions. The tests were done on gpt-j, which is not multimodal and only has about 7bn params. The only caveat is that a linear mapping model needs to be trained to map the vector space from the encoder-decoder model to the embedding space of the language model, but this isn't doing any conceptual labour, it's only needed to align the completely arbitrary coordinate axes of the vision and language models, which were trained separately (akin to an American and a European to agreeing to use metric or imperial — neither’s conception of the world changes).
It's not intuitive, but it's hard to argue with these results. Even small LLMs can caption images. Sure, they don't get the low-level details like the texture of grass, but they get the gist.
I keep reading your sort of analysis, but honestly, those priors need updating. I had to update when learning this. If 7bn params can do it, 175bn params with multimodality can certainly do it.
It's true that humans need symbol grounding, but we don't see hundreds of billions of sequences. There are theoretical reasons (cf category theory) why this could work, albeit probably limited to gist rather than detail.
I disagree with the AI doomers on technical grounds, but there's no need to overreact to the overreaction to LLMs.
Apocalyptic predictions are visible in the myths of every civilisation in the written record. They're probably a Jungian archetype to prevent populations from the potentially entropic consequences of exploring the unknown too quickly. Religious? Sure. Religions have served us well on evolutionary timescales so I'm not sure why Yann is surprised that they've been transposed into forms modern people (including atheists) can suspend disbelief in. I'm not worried about the existential risks of AI myself just yet, but these overreactions are society compute trade-offs between exporation and exploitation in the face of a very new technology. This has been happening since not long after we left the trees. It's what cultural species do.
Manufacturers mostly make black or white cars because that's what people want. Modern culture has made people so boringly conventional that there's no point in manufacturers painting on different colours, so they charge a premium or don't do it at all. Your explanation fails to explain why exactly the same trend is happening in fashion. Is there some conspiracy where clothing manufacturers are trying to restrict choice in clothing, too? At some point, you have to question the culture at large rather than individual industries.
The irony of this post. Brains are sparser than transformers, not denser. That allows you to learn symbolic concepts instead of generalising from billions of spurious correlations. Sure, that works when you've memorised the internet but falls over quickly when out of domain. Humans, by contrast, don't fall over when the domain shifts, despite far less training data. We generalise using symbolic concepts precisely because our architecture and training procedure looks nothing like a transformer. If your brain were a scaled up transformer, you'd be dead. Don't take this the wrong way, but it's you who needs to read some neurology instead of pretending to have understanding you haven't earned. "Just an emergent propery of billions of connected weights" is such an outdated view. Embodied cognition, extended minds, collective intelligence - a few places to start for you.
I'm saying despite the brains different structure, mechanism, physics and so on ... we can clearly build other mechanics with enough parallels that we can say with some confidence that _we_ can emerge intelligence of different but comparable types, from small components on a scale of billions.
At whichever scale you look, everything boils down to interconnected discrete simple units, even the brain, with an emergent complexity from the interconnections.
Turing Completeness is an incredibly low bar and it doesn't undermine this criticism. Conway's Game of Life is Turing Complete, but try writing modern software with it. That Transformers can express arbitrary programs in principle doesn't mean SGD can find them. Following gradients only works when the data being modelled lies on a continuous manifold, otherwise it will just give a statistical approximation at best. All sorts of data we care about lie in topological spaces with no metric: algorithms in computer science, symbolic reasoning in math, etc. If SGD worked for these cases LLMs would push research boundaries in maths and physics or at the very least have a good go at Chollet's ARC challenge, which is trivial for humans. Unfortunately, they can't do this because SGD makes the wrong assumption about how to search for programs in discrete/symbolic/topological spaces.
For how long? Being recent (on evolutionary timescales), evolution has created selection pressures on previously irrelevant variables (the desire to have children, in particular). Without contraception or female labor force participation, people didn't decide how many children to have before urbanisation: everyone who didn't die just kept having kids. The poorest places are still like this. But in evolutionarily novel urban contexts, people choose how many kids to have; if this decision has a genetic component, that means the current urban millieu selects for people who want more kids. Basic genetics tells us that the number of people who "want more kids" will grow exponentially (so long as culture doesn't keep driving preferences for children down). The key question, of course, is "what's the exponent?". It depends on the heritability of fertility, and data from behavioural genetics indicates that it's high in urbanised societies. If your parents had more kids than average, you probably will (even if you were adopted at birth and didn't grow up with your parents). The effect is large enough that it would have a big impact on the UN population estimates by 2100. See https://www.sciencedirect.com/science/article/abs/pii/S10905...
The cultural offset is an important caveat, but it's possible that most of the cultural fertility decline due to urbanisation has been exhausted in urban socities, in which case the heritability effect will become more important.
I find it strange that natural selection is never part of the argument on debates like this. It's laughable that it's completely ignored by the UN. I suspect the tendency of humans to view themselves as separate from the animals meakes selection seem irrelevant. But we shouldn't forget its power: the types of people who deciding not to reproduce today are the types of people who won't be there to make that decision tomorrow.
All this to day, our contemporary perspective might seem very parochial in a few generations and while we can't predict future culture, the power of selection should never be ignored.
> More people means less energy and resources per person though
Malthusianism didn't work in 1700s and doesn't work now. Economists produced endogenous growth theories explaining how more population could lead to higher output per capita in the 80s. Romer won a Nobel prize for it.
The evidence we have now strongly supports their arguments. Productivity scales superlinearly with urban population so that each doubling leads to a 15% increase in p.c output. This is true everywhere we look i.e. regressing log(income per capita) on log(city population) produces near-perfect lines with slopes close to 1.15 whether you're looking at the U.S or Bangladesh. This is easily explicable in terms of the network externalities. Furthermore, studies comparing the networked lives of people living in more/less populous areas (inferred from phone usage data) show that people living in bigger cities do indeed communicate more. How much more? 15% for every doubling of city size! All of this can be explained in terms of graph theory. More nodes -> more cultural and technical niches (interconnected node clusters) -> more potential synergies as every niche has more niches to connect to. And so on. Connectivity increases faster than linearly, hence the superlinear scaling. Not coincidentally, the connectivity of brains seems to be the same (double the number of cortical neurons in a mammal, the number of synapses per neuron increases by 15-20%).
All this to say, we certainly aren't approaching a world of less abundance per capita as the world continues to urbanise. Anybody who thinks this just isn't acquainted with the evidence.
“Generate an image of an african elephant painted in the New England flag, doing a backflip in front of the russian federal assembly.”
OpenAI made the biggest step change towards compositionality in image generation when they started directly generating image tokens for decoders from foundation llms, and it worked very well (openais images were better in this regard than nano banana 1, but struggled with some OOD images like elephants doing backflips), but banana 2 nails this stuff in a way I haven't seen anywhere else
if video follows the same trends as images in terms of prompt adherence, that will be very valuable... and interesting