Not out of the box I think; I wouldn’t trust any self-assesment like that. With enough compute, you could probably come up with a metric by doing a beam search and using an LLM to evaluate how many of the resultant answers were effectively the same as a proxy for “confidence”.
Similar to bootstrapping a random variable in statistics. Your N estimates (each estimate is derived from a subset of the sample data) give you an estimate of the distribution of the random variable. If the variance of that distribution is small (relative to the magnitude of the point estimate) then you have high confidence that your point estimate is close to the true value.
Likewise in your metric, if all answers are the same despite perturbations then it's more likely to be ... true?
I'd really like to see a plot of your metric versus the SimpleQA hallucation benchmark that OpenAI uses.