Search arrays help, but parallel tool calling assumes you’ve solved two hard problems: generating diverse query variations, and verifying which result is correct. Most retrieval doesn’t have clean verification. The better approach is making search good enough that you sidestep verification as much as possible (hopefully you are only requiring the model to make a judgment call within its search array). In my case (OpenStreetMap data), lexical recall is unstable, but embeddings usually get it right if you narrow the search space enough—and a missed query is a stronger signal to the model that it’s done something wrong.
Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.
Besides, if you could reliably verify results, you’ve essentially built an RL harness—which is a lot harder to do than building an effective search system and probably worth more.