There are at least 3 fundamentally different kinds of diff:
* Single-dimensional. Diffs of text lines are just this.
* Multi-dimensional. Diffs of words or characters are usually going to be this since lines still matter, but there are multiple approaches (line-first? weighted tokens?).
* Tree-based. Unfortunately, these are woefully scarce and poorly documented.
For text diffs, it's nontrivial to get the "missing newline at end of file" logic working.
For tree diffs, consider that for HTML something like `<p>x</p><p>y</p>` should be unmergeable, whereas `<b>x</b><b>y</b>` should be mergeable.
(Aside: the blind promotion of `<em>` and `<strong>` did great harm to the notion of semantic HTML. Most things people use italics for (book titles, thoughts, foreign words) are explicitly things that `<em>` should not be used for.)
Another thing I’ve encountered with tree/structured diffs is a concept of identity. diff([{id:1,name:foo}],[{id:2,name:foo}] should show object w/ id:1 removed and id:2 added, not id changed from 1 to 2. Tough because then your diffing algo needs to be aware of the object structure (imo using convention and saying “no objects can contain this key” is pretty tough when you accept any user generated data).
tho i would say that a diff has to define the set of operations allowed to be done to the thing being diff'ed.
E.g., in the example scenario of the diff in json objects, if a possible operation is a change in a property value (such as the "id" field), then the diff correctly deduced the smallest change possible is indeed a change in the field.
However, if you can define the set of operation to only be a change in an entire object (and no changing of id field), then surely, you can create a diff that produces the desired object structure change. It would be a custom diff algorithm of course...but it'd be quite a useful one tbh.
I think his point was that different fields should be treated differently. I.e. if you have two objects with the same ID but different descriptions then you can assume that it's the same object but with a changed description; but if you have two objects with different IDs but the same description then you should assume that the new object is completely different and the identical description is coincidental.
I don't agree that these are always the correct interpretations though. IDs could be reused (especially in a DVCS) or mistaken IDs could be corrected. This ambiguity is a fundamental limitation of the entire concept of diffing, that is reconstructing a set of operations to go from one state to another - you simply don't have the information to deduce the correct logical steps in all cases.
I love this. I think you could simplify it by generalizing. Something like immutability. These keys can’t be changed, only an object destroyed and another created. A case of that is a primary key (maybe that’s the only case).
You can always represent a change as a removal and an addition. It’s smart to actually consider when should you. “Never” and “whenever possible” don’t seem like the best answers.
I implemented tree based diff for a JSON superset https://github.com/gritzko/go-rdx
It boils down to single dimensional, very much like JSON or DOM tree is represented as a linear text.
Can you explain why the `p` example is unmergeable whereas the `b` one isn't? I can't see any difference between the two examples other than the tag used.
The second is two bold letters, one after another in a single word.
However if the html is "an application" more than it is "a document" - a b-tag with two letters, might be meaningfully different from two b-tags in sequence (for example with css:)
b { display: block }
So, I'd say as a fragment two bold tags might be mergable - but not in the general case?
Ed: ie if diffing input from a html input field (rich editor) merging bold tags would probably be what you want - when the first edit bolds first letter, and second edit bolds second letter.
Regarding your comment about the notion of semantic HTML - I think HTML lacks an easily found ground-up course or series or articles or whatnot about how you get there.
MDN talks about when a specific element is appropriate, but it doesn’t really help you discover those elements that might be relevant.
* Single-dimensional. Diffs of text lines are just this.
* Multi-dimensional. Diffs of words or characters are usually going to be this since lines still matter, but there are multiple approaches (line-first? weighted tokens?).
* Tree-based. Unfortunately, these are woefully scarce and poorly documented.
For text diffs, it's nontrivial to get the "missing newline at end of file" logic working.
For tree diffs, consider that for HTML something like `<p>x</p><p>y</p>` should be unmergeable, whereas `<b>x</b><b>y</b>` should be mergeable.
(Aside: the blind promotion of `<em>` and `<strong>` did great harm to the notion of semantic HTML. Most things people use italics for (book titles, thoughts, foreign words) are explicitly things that `<em>` should not be used for.)