At least in the original paper [1], the key idea is that instead of training a neural network to treat "images" as an array of pixels, you train a network to map a camera location and direction (in 3D space) to a distance and a color.
For example, if you tell a NeRF that you have a camera at location (x, y, z) and pointing in direction (g, h, j), then the network will output the distance at which the ray emitted from the camera is expected to "hit" something and the RGB color of whatever is hit.
Doing things in this way enables rendering images at arbitrary resolutions (though rendering can be slow), and is naturally conducive to producing rotated views of objects or exploring 3D space. Also, at least theoretically, it should allow for more "compact" network architectures, as it does not need to output, say, a 512x512x3 image.
For example, if you tell a NeRF that you have a camera at location (x, y, z) and pointing in direction (g, h, j), then the network will output the distance at which the ray emitted from the camera is expected to "hit" something and the RGB color of whatever is hit.
Doing things in this way enables rendering images at arbitrary resolutions (though rendering can be slow), and is naturally conducive to producing rotated views of objects or exploring 3D space. Also, at least theoretically, it should allow for more "compact" network architectures, as it does not need to output, say, a 512x512x3 image.
[1] https://arxiv.org/pdf/2003.08934.pdf