Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any high-enough dimensional space means the distance between any two vectors tends towards 1, so given a "good" concept all other related "good" concepts and all "evil" concepts are approximately equidistant from it, so this is inescapable; and therefore the Waluigi effect is too.

Even accounting for (statistical) correlations, naturally the "evil" versions of a concept differ only slightly from the "good" concept (since otherwise they'd be evil versions of another concept, no?) meaning that so long as there is some expressible "evilness", well, the classic notion of vector arithmetic from word2vec will carry over, even as some ineffable "evil vibes" that may apply in any number of directions and thus be applicable to a vast sway of concepts, since you can take an average of a bunch of "evil" vectors and end up with a vector that's now statistically correlated to this "evil vibe", so including this with a "good" concept that is otherwise uncorrelated allows you to create an "evil negative" of even the most "good" concept possible... and by dimensionality, it was already close in distance and similarity to begin with, so the artifact of this "vibe" was inherently embedded within the space to begin with, but emphasising this "vibe" or doing any such further statistical correlation (such as 'finetuning') increases correlation to this "evilness", and suddenly "corrupts the incorruptible", flipping a "good" concept into an "evil" negative version of that concept (hence, Waluigi).

Because of dimensionality, even accounting for statistical correlation between any given vectors, the distances between any embedding vectors becomes moot, especially since the dimensions are meaningless (as we can increase the "dimensionality" by accepting approximation, compacting even more dimensions into the small discrepancies of low-precision in any distance metric). So, for all intents and purposes, "evil" concepts aren't just similar to each other, but similar to their corresponding "good" counterparts, and to all other vectors as well, making misalignment (and, indeed, the aforementioned Waluigi effect) an inevitable emergent property by construction.

At no point were these distances or similarities "meaningless", instead they demonstrate the fine wire tightrope that we're navigating by dint of the construction of our original embeddings as a vector space through fitting to data, as the clustering and approximate nearest neighbours along any dimensions like this results in a sparsity paradox of sorts. We hope to take the next "step" towards something meaningfully adjacent and thus refine our concepts, but any time we "misstep" we end up imperceptibly stepping onto a nearby but different (perhaps "evil") tightrope, so we're at little risk of "falling" into the void between points (though auto-regression means we must end up at some attractor state instead, which we might think of as some infinite plummet through negative space, potentially an implicit with no direct vector representation) but instead we may end up switching between "good" and "evil" versions of a concept with such missteps... and by the argument around approximate values effectively placing additional dimensions around any basis vector, well, this quickly begins to resemble a fractal space like flipping a coin or rolling a die, where the precision with which you measure the results may change the output (meaning even just rounding to the nearest 0.001 instead of 0.01 may go from "good" to "evil", etc) in such a way that we can't even meaningfully predict where the "good" and "evil" vectors (and thus outputs) are going to arise, even if we started with human-constructed basis dimensions (i.e. predefined dimensions for 'innate' concepts as basis vectors) because by approximation the construction will always "smuggle" in additional vectors that diverge from our intent — the tightropes crisscross around where we "want" to step (near basis vectors) because that's where we're already likely to step, meaning any statistical correlation must go in the vicinity and by dimensionality so must unrelated concepts because it's "as good a place as any" based on the distance metric, and if they're in that vicinity too, then they're likely to co-occur, and now we get a survivorship bias that ensures these negatives and "evil vibes" (and thus any Waluigi) will remain nestled "close by" since those are the areas we were sampling from anyway (so act as a sort of attractor that pulls vectors towards them), and unavoidably so because by going at it from the other direction, those are the points from which we initially started constructing vectors and statistical correlations from in the first place, in other words, it's not a bug, it's literally the only feature "working as intended".



> Any high-enough dimensional space means the distance between any two vectors tends towards 1

Yes, but, you forget the impact that the attention mechanisms have. While high-dimensional embeddings suffer from concentration of distance, attention mechanisms mitigate this by adaptively weighting relationships between tokens, allowing for task-specific structure to emerge that isn’t purely reliant on geometric distance. If we can effectively "Zero" many of the dimensions in a context sensitive way, suddenly much of this curse of dimensionality stuff simply stops applying. It's obviously not perfect, transformers still struggle with over-smoothing among other issues but I hope the general intent and sentiment of my comment is clear.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: