Unfortunately it takes ~9 months just to build that network up to the point where you can start training it, and then the training itself is literally years of hard effort.
finally, we've achieved clarity of purpose and target as predicted by the classic internet meme Poe's Law. from now on our statements shall no longer be mistaken for nor contribute to that which we intend to criticize
Isn't this already done? I remember a "try to hack the llm" game posted here months ago, where you had to try to get the llm to tell you a password, one of the levels had a sanitzer llm in front of the other.
Isn't that just another guardrail that can be bypassed much the same as the guard rails are currently quite easily bypassed? It is not easy to detect a prompt. Note some of the recent prompt injection attack where the injection was a base64 encoded string hidden deep within an otherwise accurate logfile. The LLM, while seeing the Jira ticket with attached trace , as part of the analysis decided to decode the b64 and was led a stray by the resulting prompt. Of course a hypothetical LLM could try and detect such prompts but it seems they would have to be as intelligent as the target LLM anyway and thereby subject to prompt injections too.