It is interesting to see the use of the A* path finding algorithm for finding optimal matchings of nodes.
The approach from Chawathe et. al splits nodes from the before/after trees into chains by their label in the syntax grammar, and then runs myers’ longest common subsequence on each pair of chains. Some parameters t, f are used to have an approximate ‘equals’ method for subtrees.
I’d be curious to see if this approach handles re-ordering of nodes better. The ‘fastMatch’ algorithm described above will typically miss matching cases where a node that is not order sensitive (i.e a function in a namespace can be moved somewhere else in that namespace).
Most tree diffing papers that I've seen focus on either (1) providing a minimal diff and accepting the performance cost or (2) providing a relatively minimal diff and focusing on the performance.
I've generally found that you need a minimal diff to get a good result, so papers in (2) are less applicable. I've also found several cases where there are several possible minimal diffs, but there's a clear 'correct' answer from the user's perspective.
Difftastic doesn't handle moves: the edit set is add, remove, or replace similar comment. If you reorder functions, it will take the largest unchanged subset. Moves are hard to model in a diffing algorithm, but they're also very hard to display coherently in the UI.
I know a few code forge websites (e.g. Phabricator) show moves in a fairly comprehensible way, although they're all based on line-based diffs.
The approach from Chawathe et. al splits nodes from the before/after trees into chains by their label in the syntax grammar, and then runs myers’ longest common subsequence on each pair of chains. Some parameters t, f are used to have an approximate ‘equals’ method for subtrees.
This iteratively builds a set of matchings between equivalent nodes from the old and new trees. Here’s the paper https://dl.acm.org/doi/10.1145/235968.233366
I’d be curious to see if this approach handles re-ordering of nodes better. The ‘fastMatch’ algorithm described above will typically miss matching cases where a node that is not order sensitive (i.e a function in a namespace can be moved somewhere else in that namespace).