This is basically how Apache Storm fault tolerance worked. The problem I have with it is what do you do when one of your workers failed and you're not sure if it failed before sending a job completion message? Storm just restarts everything I believe, which is not great. If you just restart that job you could be left with 'dangling' acks.