Which can also be trivially done in any halfway-decent image editor. Algorithm-wise, it'd probably look a lot like QR code detection - which of course doesn't need any kind of fancy AI.
What I mean is you can probably get this to work with pure prompt engineering, and English language just by saying stuff like "Here's an image with correct hold positions", and then submit another image that says "Are the positions in this image correct." I just meant basic image understanding of AI, like what OpenAI has.
I realize there's an infinite number of ways to accomplish this that would be more complex. What I was stating is the simplest possible way being pure prompt engineering. You could even try with OpenAI right now, I didn't try.