You are spot on and this has been with most/all LLM since the beginning.
- Asking for structured output in the same request as the main unit of work introduces the chance of lower quality output. Instead you should be doing your unit of work and then following up with a different request/model to parse it into json or your flavor of structure.
- Grading on numerical scales introduces weird bias. I never went down the route too far but I would notice certain floating point numbers would show up too often when using numerical scales. Using a word based scale works a lot better.
- Asking for structured output in the same request as the main unit of work introduces the chance of lower quality output. Instead you should be doing your unit of work and then following up with a different request/model to parse it into json or your flavor of structure.
- Grading on numerical scales introduces weird bias. I never went down the route too far but I would notice certain floating point numbers would show up too often when using numerical scales. Using a word based scale works a lot better.