LLMs are not even good wordcels

dissahc · on June 8, 2024

i don't think having weaknesses like these say too much about the general capabilities or potential of LLMs. try this yourself: generate a novel pangram without any sort of iterative process or revisions -- the first thing that pops into your head you must commit to paper. it's very hard. it's a lot easier if you go in alphabetical order, as you don't need to keep track of which letters you've already used. interestingly, gpt-4o also performs better at this task when you ask it to go in alphabetical order.

LLMs have known weaknesses, many having to do with an inability to "think" without using tokens. so for tasks like these, you can dramatically increase their performance by getting them to think out loud. this was the crux of the "let's verify step by step" paper.

i ran this prompt three times:

### prompt start ###

your goal is to create a novel pangram in the spanish language that sounds natural. it should be grammatically correct and coherent. it shouldn't be just technically coherent and grammatically correct, it should sound like a normal sentence.

first, print out the spanish alphabet. these are your $remainingletters. $sentence = "". then enter a loop where you do the following:

loop 1: - select a random letter from $remainingletters choose based on which one allows you to add the most natural sounding word to the sentence, do NOT go in alphabetical order - eliminate the letter you chose from the remaining letters - ensure the word you chose actually begins with that letter! very important - add the word you chose for that letter to $sentence - are there letters in $remaining? if so, go back to start of loop 1. otherwise move on to loop 2.

loop 2: - go through the spanish alphabet in order and ensure your $sentence contains a word starting with that letter - once every letter is accounted for, translate the sentence to english - does it sound like a natural sentence? - if not, go back to the start of loop 2 - if so, print $sentence as well as its translation in english

think out loud, keep track of your work as you go

i'm not asking you to generate code, i'm just explaining how you should accomplish the task

### prompt end ###

i ran this prompt three times, and each time it generated a valid pangram that was coherent and grammatically correct. i can't be bothered to run it a bunch of times to get an accurate success rate, but i'm fairly sure there exists a prompt with a success rate of 100%. there are a lot of output tokens available, and by asking the model to iterate, it will arrive at something correct far before 128,000 tokens are exhausted.

sidenote: when i asked gpt-4o to generate 5 novel pangrams in english, and it got them all right. so language definitely matters when it comes to getting things like this right in one shot.

epidemian · on June 8, 2024

[author of the post here]

thanks for the very thorough reply! it's fascinating to see the techniques used to improve LLM outputs :)

i'll reply to some specific points, but i think your main argument of trying to find ways of working effectively with LLMs is spot-on.

> try this yourself: generate a novel pangram without any sort of iterative process or revisions -- the first thing that pops into your head you must commit to paper. it's very hard.

yes, it would be indeed very hard. and i don't know why i would try to do it that way. notice also that i didn't instruct ChatGPT to do it that way either.

to generate a pangram, i'd probably start with some random phrase, count the letters i've used and which ones i'm missing, and then iteratively tweak the phrase to use more letters of the alphabet until i've used them all. that at least seems like a reasonable strategy. and i would expect any intelligent agent to do the same. not that particular strategy, but "the same" as in: to try to find a strategy that works. after all, isn't that a fundamental part of intelligence? being able to find solutions to novel problems.

i know that LLMs don't work that way. and that's fine. but that was also the main point i tried to make in the post: we're being sold LLMs as "intelligent", but they don't work in any way like what we would intuitively say it's intelligent.

hydrolox · on June 8, 2024

I think that shows how LLMs lack an important part of what we as intelligent agents have, as the parent comment pointed out, the innate ability to have some sort of train of thought or self check mechanism. In a human, as you said, we don't immediately blurt out or write out a phrase, since if someone gave us this problem we would immediately start considering the constraints and possibilities. By contrast, LLMs do not have this and the ability has to be "bolted on" through a pre prompt like "consider if your answer is correct" and "think step by step" etc. As far as actually choosing a viable strategy, such as figuring out to go alphabetically, this probably emerges as they get larger and larger; i.e, they need both the bolted on train of thought ability and also an actual good sense of reasoning and logic, which could be compared to a person (maybe a child) who can't come up with a good solution, let's say, to making a pangram (in this case).