Aren’t they doing alignment? One way is to simply omit problematic material from the training set. Another is to “penalize” the model when it does say something problematic — essentially teaching it that the output is undesirable.
Presumably they are also constructing the prompt so as to avoid those things, too, and adding external filters on top of that. But I doubt that’s all they’re doing.
They are doing alignment but through a shallow and fragile strategy. If you build a very smart, very capable AI but it turns the earth into gray goo if asked to in a roundabout way, you have failed alignment.
Presumably they are also constructing the prompt so as to avoid those things, too, and adding external filters on top of that. But I doubt that’s all they’re doing.