With a single modern CPU you can do a lot of operations at each pixel of a high resolution screen and still get a framerate that your monitor cannot churn. Then, you typically have a few more cores around to do other stuff.
Unless your language introduces an unreasonable overhead, a for loop over the pixels is perfectly appropriate and fast.
The problem in this case is not the looping over each pixel, but the overhead of invoking a dynamic method on each pixel. For example, if you're iterating over a []byte and setting each value to zero, the compiler can optimize that to a single memclr. Using an interface masks the underlying representation and consequently prevents any sort of inlining.
> Using an interface masks the underlying representation and consequently prevents any sort of inlining.
This sounds like a limitation of a particular optimizing compiler/interpreter rather than a problem of the language itself. For example, the plain lua interpreter incurs quite a lot of overhead for this, but the luajit interpreter is oblivious. The standard python interpreter definitely adds a lot of overhead.
Unless your language introduces an unreasonable overhead, a for loop over the pixels is perfectly appropriate and fast.