My Favorite Algorithm: Metropolis-Hastings

Matumio · on June 1, 2015

I like the build-up explanation in this chapter (Monte Carlo Methods) from MacKay's book: http://www.inference.phy.cam.ac.uk/mackay/itprnn/ps/356.384....

xioxox · on June 1, 2015

In practice, there are some nicer algorithms for doing Markov Chain Monte Carlo, such as Goodman & Weare's Affine Invariant sampler, as implemented in http://dan.iel.fm/emcee/current/ . This algorithm has the advantage of not needing a proposal distribution.

gfodor · on June 1, 2015

I always go back and forth on which of the two things are more amazing: the fact that we can accurately estimate the rendering equation in tractable time, or the fact that the universe manages to do it in real time.

eru · on June 2, 2015

The universe can take all the time it wants. You wouldn't notice.

mandelken · on June 1, 2015

Listen to Press on the naming of the algorithm, seems like Metropolis just put his name on the paper:

https://youtu.be/4gNpgSPal_8?t=938

mjfl · on June 1, 2015

This is true! It should be called the Rosenbluth-Hastings algorithm.

stellographer · on June 1, 2015

Interesting choice...

Seems like genetic algorithms, particle swarms etc. would be more attractive choices since they solve the same problem and are inherently parallelizable while Metropolis Hastings is almost 100 years old and designed for a 4 function calculator.

Although I guess some people have meta-parallelized it... but still seems like a patch job compared to modern likelihood navigation algos.

murbard2 · on June 1, 2015

No!

The point of Metropolis-Hastings is to sample from a distribution when you do not know the partition function. It is the most important building blocks in a set of algorithms broadly known as Markov Chain Monte Carlo. These algorithms are particularly useful when performing Bayesian statistics.

Genetic algorithms will not give you samples from a distribution, they only perform optimization. Particle swarms also focus on optimization, and on top of that, they do not seem to have either theoretical justification or empirical success.

MH is embarrassingly parallel since you run multiple chains at the same time. Again, the point isn't optimization (that would be simulated annealing) but sampling.

Being 100 years old is also largely irrelevant. People will publish new algorithms to get publications all the time. That do not mean they necessarily outperform the old ones. Gradient descent is the basic algorithm used in training all of these cool new deep learning algorithms, and it's much older than MH.

Yes, there are more recent improvements to MH, the two biggest ones being Hamiltonian Monte Carlo (which uses gradient information) and Parallel Tempering (which is somewhat similar to homotopy optimization), but that's hardly a reason to dismiss the importance of this algorithm.

stellographer · on June 2, 2015

I see you feel strongly about this.

"The point of Metropolis-Hastings is to sample from a distribution when you do not know the partition function." That's one point, yes. The other is optimization. In which case I prefer the others I mentioned.

"they do not seem to have either theoretical justification or empirical success." That's just patently false. They _do_ have empirical success. And besides, a lot of these variants have MH implemented inside of them to some extent o.o

"MH is embarrassingly parallel since you run multiple chains at the same time." I can make six single grilled cheese sandwiches in 2 minutes, but it takes me 8 minutes to make a 6-decker grilled cheese. Is that a parallel process?

"Being 100 years old is also largely irrelevant." In my opinion it is relevant since computation can now be done in completely new ways than 100 years ago.

murbard2 · on June 2, 2015

1) Can you point out where particle swarm optimization has been successfully applied? Papers specifically about particle swarm optimization carry very little weight as it is very easy to design toy problems where any optimization technique is going to perform well, I'm looking for actual, practical use.

2) When you are sampling a distribution, you're not trying to make a 6-decker grilled cheese, you're trying to make many grilled cheese sandwiches.

3) In completely new ways? Not really. Algorithmic complexity which dominates run time independent of the computing medium. An algorithm designed to be efficiently run by a group of human "computers" with calculators is probably very similar to the same algorithm designed to be run by a CPU. If anything, the CPU optimized algorithm are likely to benefit from more sequential processing and less parallelism.

stellographer · on June 3, 2015

just downvoting every one of my comments because you disagree with them? That's cool.

1) Just one example? Here: http://adslabs.org/adsabs/abs/2009PhDT.........4C/

2) You are trying to find the global maximum. How do you not understand the value of communication when searching for a maxima in the likelihood distribution? You're just being intentionally obtuse.

3) Yes, really.

A story:

You have a landscape with mountains and hills and you have one person trying to find the tallest mountain. That person is blind, they can't see shit. That person is also mute and deaf. Their only sense is a vibrating altimeter. They get drunk, and climb mountains for 100000 days, trying to find the tallest mountain.

You are advocating the idea that you should send 1000 of these blind deaf mutes out there one at a time (running these in parallel is just faster serial) and then they should vote on which mountain is the tallest at the end.

I'm saying you should send a bunch of not-deaf-mutes out there (ie implement mutation, breeding, cross-contamination, gravitation, whatever) so they can tell each other where the stupid mountains are (this requires _parallel_) and they don't waste their whole time stumbling around (burning in).

You're just being intentionally argumentative.

murbard2 · on June 3, 2015

HN does not allow downvoting replies to one's comment, so someone else must be downvoting you. Your tone is aggressive and you display a poor grasp of the topic which is probably why you're being downvoted.

1) This is an abstract of a PhD thesis which makes no mention of particle swarms, I couldn't find the full text. This is your evidence?

2) No this isn't about looking for the global maximum. Several people have explained to you that this is a sampling algorithm, but you still fail to show understanding of the difference.

3) Your story doesn't demonstrate your point and you do not understand the argument I presented to you. The types of algorithms suited for an army of people with calculators 100 years ago isn't fundamentally different from the type of algorithm suited for computers today, and if anything, it's likely to be more sequential, not less.

enupten · on June 1, 2015

One of the parallelized versions is Gibbs sampling that is used for sampling from Bayesian networks. In this case you don't even need a proposal distribution; neither would you need the test from MH.

I think Graphlab (https://dato.com/) comes implemented with something of this kind.

The trouble with Particle swarms/Genetic algorithms is that they aren't guaranteed to sample from the underlying p.d. It is not yet apparent whether you can find the mode of a distribution faster by choosing a Markov chain whose stationary distribution is different from the underlying one.

stellographer · on June 1, 2015

Are you sure that Gibbs sampling isn't just the multivariate version of MH?

What I'm saying is that convergence speed for MH is limited by the fact that guesses cannot communicate with each other... which doesn't matter when you have a pencil and a 4 function calculator like when it was designed.

A genetic algorithm or a particle swarm algorithm is capable of much swifter convergence because the guesses _can_ communicate and influence the direction of the drunken walk.

physicsyogi · on June 1, 2015

When I was in grad school, we used MH to compute 400-dimensional integrals. We computed ground state and excited state properties of a particular system, using a wavefunction as the probability distribution. This was easily parallelized.

e.g http://phya.snu.ac.kr/~gsjeon/Primary/Abstract/abs26.html

While I can't say that one guess communicated with others, I can say that whenever I moved one particle, the others knew about it immediately because the wavefunction described a strongly correlated system. Communication between guesses sounds really interesting though. I've been out of the game for a while, so I'll have to look that up.

stellographer · on June 1, 2015

Check out this guy's work for some really cool examples: http://en.wikipedia.org/wiki/Xin-She_Yang

tel · on June 1, 2015

If your goal is optimization then MH is a bad choice. This is fine since MH is simply not an optimization algorithm! It's design to make accurate samples from a posterior distribution which you can measure but not find a functional form for.

It's a vastly more challenging problem than mere optimization.

enupten · on June 1, 2015

It is a multivariate (co-ordinate wise) version of MH. You can paralleize it because the Bayes net allows a decomposition of the p.d.

I feel like while your comment on Global-optimization algorithms may indeed be true, I don't quite yet believe that the hacks they involve are quite that general yet.

MH wasn't designed for Global optimization, and there is only one "particle". I guess this what you meant by "parallel" ?

stellographer · on June 1, 2015

Yeah, that's what I meant, the one particle bit.

Frequently you'll see things along the lines of "we ran 10,000 MCMC iterations to find the solution" and my first thought is "that must be a lot of wasted cycles."

I think I see what you mean now about the parallelization by decomposing the variables (splitting the problem into i separate problems which can be chained independently?)-- I didn't know that was a possibility. I'll have to look at that.

tel · on June 1, 2015

Well, you can run multiple simultaneous particle chains and sum their results. There's some wasted work since each will need to burn in, but modern algorithms can make that go quite quickly.

enupten · on June 1, 2015

Sure - but that's not what the OP is referring to. There is no "sync" step in multiple parallel chains, unlike particle swarm & genetic algorithms.

MCMC is also used for finding MAP solutions, an operation which is strictly less difficult than computing the partition function.