Difftastic: Syntax-aware structured diff tool

austincheney · on July 8, 2021

I worked on this problem for just over a decade before moving onto other things. It’s a tough problem to solve.

The biggest problem I ran into is that the largest segment of user growth were too fickle. They wanted all kinds of magic in new optional features for their personal preferences that took incredible effort. I lacked the analytics to see who used which exotic features. Most of these people just wanted a beautifier more that a diff tool and would drop you in a heartbeat for more popular tools that wouldn’t do what they wanted but were popular.

The tool I wrote did have a strong following mostly around markup language parsing that was not at all exotic but solved problems other tools refused to approach.

My guidance is don’t become a code beautifier. In the languages I was supporting during the time frame I was supporting this code beautifiers were all the rage. Nobody seemed to want a diff tool with extra capabilities. Stick to being a diff tool. The people that are intentionally looking for intelligent diff tools tend to be more engineering focused and make for a loyal audience. People looking for code vanity are just the same as window shoppers walking down a street.

Wilfred · on July 8, 2021

Thanks, this is good advice. There are some super featureful diff tools out there. For example, https://github.com/dandavison/delta does a line-based diff but it also syntax highlights its output.

I'm hoping that defining syntax in a separate TOML file will let end users extend difftastic for their own languages/config files. I want to keep difftastic small and manageable.

pdimitar · on July 8, 2021

Can you give us a link to your tool?

hinoki · on July 8, 2021

The GP’s first submission to HN was a link to a diffing tool, so I assume it’s this: https://prettydiff.com/

tlamponi · on July 8, 2021

Surely far from being as elaborate as the linked tool, but I use the following git command a few dozen times daily:

  git diff --word-diff=color --word-diff-regex='\w+'

I added two aliases to my .gitconfig, one for diff and one for show:

  [alias]
    word-show = show --word-diff=color --word-diff-regex='\\w+'
    word-diff = diff --word-diff=color --word-diff-regex='\\w+'

Those small things improved development and reviewing a lot for me!

If stuff moved around or got it's indentation changed I either add `--color-moved` and/or `-w` (ignore whitespace changes) flags to filter out extra noise.

Sometimes I need to use another regex though, e.g. a simple dot `.` for match all with no greedy +

fragmede · on July 8, 2021

A tiny bit of shell golf - I have similar aliased to wdiff (or even wd if you're especially parsimonious).

blixt · on July 8, 2021

Been hoping for more of this for years. We stare at diffs all day yet we have to accommodate the computer by understanding that the parenthesis it claims was changed wasn’t actually changed, there was just another set of parentheses added. There’s of course limits to how much a diff tool can extract meaning from two pieces of content, but structure and perhaps even heuristics like “new function was added here, maybe the curly brace belongs with that and not the old function” would certainly help.

vcmiraldo · on July 8, 2021

you'll probably enjoy the patience diff: https://blog.jcoglan.com/2017/09/19/the-patience-diff-algori...

pedrovhb · on July 8, 2021

This is cool. I work with a large Python codebase that's been around for a while (hasn't been that long since it's been completely off Python 2). It's not a bad codebase, but naturally it has almost no typing annotations, and other devs are just starting to come around to the idea.

I love `mypy --strict` but it produces way too much noise with all the other code, so I've been using a tool I made that runs mypy over the changed files and then greps the output to filter for lines that `git diff` points out have changed. It's pretty rough and imperfect (doesn't catch errors that started appearing in unchanged files, or unchanged lines it the same file, for instance) but it's still quite helpful. I've been meaning to make an improved version that runs mypy over the branch point, then over my branch, and then maps new and changed lines between them so it displays all errors that are new, but I haven't gotten around to it yet. It'd be useful for other tools too, like semgrep.

NateEag · on July 8, 2021

I'd forgotten about this, but years ago I was working in a huge PHP nightmare codebase.

So, I hacked together a pre-commit hook that blocked the commit only if the configured style checker registered errors on lines being added by the diff.

It never got very polished, but I wound up using it in two codebases over the years.

https://github.com/NateEag/diff-check

zomglings · on July 8, 2021

This sounds well in line with what I have been building - a tool to take syntax-aware diffs across git commits: https://github.com/bugout-dev/locust

It currently supports Python, Javascript, and Java. I like the idea of mypy changes, as well.

pedrovhb · on July 8, 2021

I'll have a look, thanks!

davidkunz · on July 8, 2021

To ease the pain in conventional differs, we use a pre-commit hook to format the source code (prettier). This way we only see differences if something _actually_ changed.

Cthulhu_ · on July 8, 2021

I think code formatting should be mandatory and one of the first things you adopt in your project. Resist code style rule changes as much as possible, and if you do, apply them across the whole codebase in one go to avoid churn and noise in diffs down the line.

And if you do make style changes, put them in a separate commit at the very least so the diffs are cleaner and code reviews are easier.

In my project I use gofmt (goimports) for back-end code and prettier for front-end; I've configured my editor to apply those on save, and a pre-commit hook to either run the formatter, or error if the formatting is not according to the spec.

One of Go's proverbs is "Gofmt's style is no one's favorite, yet gofmt is everyone's favorite.". Consistency and low noise is more important (in that case) than a specific code style preference.

martijnarts · on July 8, 2021

And if you do decide to do single-commit massive style changes, add the commits to an ignore revs file: http://git-scm.com/docs/git-config#Documentation/git-config....

lpapez · on July 8, 2021

Did not know that thing existed, super useful. Thanks a ton.

zamalek · on July 8, 2021

> gofmt

gofmt is in a small set of formatters that disallows configuration and choice, it's also the first to my recollection. This is a feature because deciding on a coding standard is bike shedding. It's also extremely aggressive, and undoes pretty much any choice you may make in formatting your code; a feature, again.

Not all languages are this fortunate. Some have configuration (cargo fmt), others aren't aggressive enough (Roslyn).

llimllib · on July 8, 2021

you might want to check out gofumpt too, if you haven't already: https://github.com/mvdan/gofumpt

Wilfred · on July 8, 2021

A syntactic differ like Difftastic is very helpful when your codebase is autoformatted. Formatters often reflow code.

Given the code:

  foo(one, two, three);

If you add an argument and reformat:

  foo(
    one,
    two,
    new,
    three
  );

A line-based diff can make it hard to spot what's changed.

maw · on July 8, 2021

When I was still formatting code manually (... -ish; emacs did a lot of the tedious work for me) I eventually settled on a style very similar to your four-argument example, precisely because it makes diffs easier to read.

For the same reason, I asciibetically sorted things when it made sense.

Now I use prettier and black and I'm mostly satisfied by them, but their reflow behavior puts the lie to "[black] makes code review faster by producing the smallest diffs possible."

danuker · on July 8, 2021

Prettier works for JS.

A similar Python tool is Black: https://github.com/psf/black

> Black makes code review faster by producing the smallest diffs possible.

conceptme · on July 8, 2021

This worsens the problem especially in templates when the nesting changes.

frafra · on July 8, 2021

Good diff tools will only show you that the indentation changed, not the line as a whole (Meld for example).

philipov · on July 8, 2021

This seems like a really difficult issue to solve in the general case, but I found that solving it for the specific case I had was a tractable problem.

I have a need to diff the output of a query and compare it to the last time it ran, to do regression testing. Just diffing the resulting CSVs wasn't very useful, because I needed the ability to do things like ignore new columns, and report the exact column that had differences from the previous version.

I was able to do that by defining a primary key on which I could outer join the two tables. Missing or new rows would be the ones that didn't join, and then I could do a per-column comparison for each row that did join.

runeks · on July 8, 2021

It's a great idea, but I don't think defining the syntax of a programming language as a syntax.toml file will work for enough programming languages for this to be useful. You're basically rewriting the parser of your language in a DSL that isn't as expressive as the language the parser is written in.

I think you'd need another parser/syntax interface for this to work. E.g. running a binary that you can submit source code to which responds with a JSON file containing the parsed tokens. That way you can reuse the compiler's parser.

Wilfred · on July 8, 2021

Yeah, so it's basically a lexer with an extremely simplistic parser.

Compiler parsers aren't a great fit for difftastic. They discard comments, they may not give you output if there are syntax errors, and they're usually tied to a specific language version.

Since this format works well for Comby (Rijnard's talk is excellent: https://www.youtube.com/watch?v=JMZLBB_BFNg ) I'm hopeful it's an adequate solution for Difftastic.

It will also users to add their own custom syntax/config formats.

That said, using tree-sitter might be an option. It's more forgiving than compiler parsers.

zhengyi13 · on July 8, 2021

FWIW, as soon as I saw this project, I wondered specifically about Treesitter's applicability to this problem, and I found https://github.com/afnanenayet/diffsitter.

Maybe there's something to be learned there?

affyboi · on July 12, 2021

Author of diffsitter here

I think some of the issues are alleviated because tree sitter was made with text editors in mind, so you're not really getting a compiler parser as editors still care about comments and whatnot. It's also fast, which was a big motivation to use it.

zokier · on July 8, 2021

Gumtree diff is based on ast: https://github.com/GumTreeDiff/gumtree

There is some discussion of using Treesitter for parsing, that would potentially open door for many languages: https://github.com/GumTreeDiff/gumtree/issues/148

vcmiraldo · on July 8, 2021

That is a really difficult problem for more reasons than what fits in this comment :) In fact, I got my PhD studying this very problem (https://victorcmiraldo.github.io/data/MiraldoPhD.pdf).

I did not find any description of how your diffing algorithm works nor how you represent a patch. I'd be really curious to know more.

Wilfred · on July 8, 2021

Wow, thank you for the pointer! I've added it to https://github.com/Wilfred/difftastic/wiki/Structural-Diffs as I'm trying to understand the other solutions in this space.

Difftastic does not create a patch or worry about merging. That's a hard problem that I'm not trying to solve. Instead, it builds two ASTs, then marks each node as unchanged or novel.

To compute the diff, I use a graph search. Each vertex represents a position in both the left and right ASTs.

Suppose you're comparing A with X A.

Start node:

  Left: A   Right: X A
        ^          ^

The possible next nodes are:

(1) Treat A on the left as novel.

  Left: A   Right: X A
         ^         ^

(2) Treat X on the right as novel.

  Left: A   Right: X A
        ^            ^

Both (1) and (2) are the same 'distance', but (2) is closer to the end node, because there's a edge from (2) to the end that marks A as unchanged.

I've implemented this using Dijkstra's algorithm. My graph is directed and acyclic, so there are faster algorithms like topological sort. However, I don't construct the whole graph in advance (that would take O(N^2) memory) so instead I construct the graph nodes as necessary.

(This is very similar to Autochrome, which I've linked in the README. Autochrome has a worked example which is really helpful.)

At some point I think I'll have to use A* search instead. If there are more than 500 lines of code with lots of changes, difftastic can take a few seconds to terminate due to the naive graph search.

vcmiraldo · on July 9, 2021

Thanks for the reply Wilfred! I was not familiar with Autochrome, I will certainly have a look!

That's interesting, I like the idea of not worrying about patching nor merging, giving you a tool that is focused on "communicating the differences to a human", and indeed, it means you don't have to worry about a whole bag of problems.

One insight that I came across (more info on Chap 5 of my thesis) is that not considering or handling duplication means you incur a quadratic slowdown in your search algorithm. For example, say you're diffing `A` against `Bin A A`. If you can't understand that `A` was duplicated, which `A` do you copy? You have to evaluate both options even though it really doesn't matter which one you copy.

One good middle ground for speeding up your algorithm while not having to worry about displaying duplications is to have an intermediate step where first you diff with duplication detection, but then you just go over the result and make arbitrary choices about which duplicate to copy and which to insert/delete.

dan-robertson · on July 8, 2021

I think the diffing is the “obvious” graph search algorithm between trees, where a “tree” is a list of atoms or trees (think lisp lists).

Basically to diff a tree of n top-level elements against one of m elements, construct a graph where nodes lie on an (n+1)x(m+1) grid. Each node (a,b) corresponds to having looked at a elements of the first and matched them to b elements of the second list. Add edges (a,b)->(a+1,b) for deletion; (a,b)->(a,b+1) for insertion; and (a,b)->(a+1,b+1) for an inner diff (ie basically this graph search problem again). Choose weights to apply to node and now find the shortest path from (0,0) to (n,m).

vcmiraldo · on July 8, 2021

From you description it seems like we're just computing the standard insert-delete tree-edit-distance. These tend to be slow.

This implies that the patch language only supports insertion, deletion and modification of nodes, which is a shame since refactorings, moves and duplications are also common operations in the source-code domain. Additionally, if the patch language only supports insertion, deletion and modification, the merging algorithm will perform poorly.

Wilfred · on July 8, 2021

Yep, that's a fair description. Note that I'm not providing a merge algorithm, just a pretty way of viewing changes.

I did look at modelling moves in an earlier prototype, but it's incredibly hard to display the result in a coherent way when there are also insertions. It was also easier to drop it when I moved to Dijkstra.

As you can see in the screenshot in the readme, it does support inserting tree nodes whilst preserving children, which covers a ton of cases.

Audiophilip · on July 8, 2021

My favorite diff tool is the one shipped with Plastic SCM, Xdiff. Since it's visual, it makes it very easy to see what changes have been done to the file.

https://www.plasticscm.com/features/xmerge

omgtehlion · on July 8, 2021

And theirs Git UI, gmaster, includes the same diff tech too. I do not use it for everyday tasks, but when I need to understand complex and/or big changes this is my go-to tool.

Kinrany · on July 8, 2021

Reminds me of Comby [1]: it's language-agnostic and relies on various brackets to make search-and-replace more structured.

Edit: ah, of course, Comby is referenced in the Readme.

[1]: https://comby.dev/

foreigner · on July 8, 2021

I built a diff tool for spreadsheets years ago: https://support.smartbear.com/collaborator/docs/working-with...

Never really worked all that well. I looked for research on how to diff something like that but didn't find anything useful. IIRC the diff works by "serializing" the cell grid, effectively treating each cell as a separate "line" and then running that through a conventional line-based diff algorithm.

mikepurvis · on July 8, 2021

I'd feel so much more motivation for checking out alternative diff tools if there was a better story for integrating them with the review tools in Github, GitLab, etc. I know there's nothing anyone can do about that— it's something the Git hosts themselves have to enable, or I have to see enough benefit in it to go to an dedicated review tool to make the bother of that worthwhile.

I believe Gerrit has a pluggable diff— is there anything more broadly on improving this story?

Wilfred · on July 8, 2021

Definitely!

I still look at diffs in the terminal pretty often, but all my code reviews are in rendered HTML.

That said, there needs to be a credible tool before review tools can adopt it! GitHub does a line-based diff with word-based highlighting, which is probably the best you can do without syntactic smarts.

mikepurvis · on July 8, 2021

It would be neat if there was a way to supply a "diff hint" or something right in your git commit metadata. Obviously the receiver/reviewer/renderer can ultimately do whatever they want, but it would helpful if I as the one preparing the change could at least specify intent.

I guess projects like the kernel where the review system is built around emailed patches kind of already get this for free— once committed, the change will be rendered according to the local user's git settings, but during the review itself, it will be a diff prepared by the change's author that will be under discussion.

In a glorious future where GitLab has four different diff options, it would be great if I could specify that I want it to default to the hinted diff tool, falling back to my preferred one if there is no hint.

dan-robertson · on July 8, 2021

I personally am much more excited by “sliders” than the structure-aware diffs. Marking additions between [], it is the difference between e.g.

  handle_case [some new
    case over multiple lines
  handle_case] some existing case
  
  [  check_invariant();
  }
  
  function newFunc(){
    ...
  ]  check_invariant();
  }

And

  [handle_case some
    new case]
  handle_case old case

  [function newFunc(){
    ...
    check_invariant();
  }]

Wilfred · on July 8, 2021

I agree sliders are a problem, and I hope to have a solution there.

Syntactic differs already do better because they understand that parentheses/brackets are paired. Difftastic does OK with this example: https://imgur.com/a/pVlVBo5

dan-robertson · on July 8, 2021

Yeah I’m keen to see your solution.

FWIW, the formatting of the snippets I wrote above was as two separate diffs for additions with the new additions (ie green parts) represented with [square brackets].

mookid11 · on July 8, 2021

I wrote diffr [0] for that purpose; it serves me well, especially if your team makes code with long lines.

In my opinion, a simple approach that does NOT make any parsing is more efficient (what about bugs in your parser? code with syntax errors? also, how fast would the parser be?)

[0]: https://github.com/mookid/diffr

feanaro · on July 8, 2021

Many of your concerns could be alleviated by using Tree-Sitter. (https://tree-sitter.github.io/tree-sitter/)

pfdietz · on July 8, 2021

Tree-sitter is great, but I find it could do a better job with broken code. This is particularly important when parsing things like C or C++ where the preprocessor makes it likely that unpreprocessed code can't be parsed anyway.

eproxus · on July 8, 2021

Couldn't something like this be based on e.g. Sublime's syntax definitions? Then it would work on all languages/formats that had such a definition.

awinter-py · on July 8, 2021

syntax-aware semantic history would be incredibly useful for code review and codebase archaeology. better detection of global rename, refactorings, and moves would make CR diffs way less messy.

code review is a necessary but painful part of releasing good code on a team; anything that makes it slightly easier is a force multiplier for companies whose main bottleneck is software

beermonster · on July 8, 2021

A good time to also point out https://tekin.co.uk/2020/10/better-git-diff-output-for-ruby-...

7sidedmarble · on July 9, 2021

Why the hell is that not just the default.

SeriousM · on July 8, 2021

That's awesome!

Much better than "guessed" syntax diff or even line diff.

jd115 · on July 8, 2021

What I really want to know is what can I use to diff XML files, semantically?

vbarta · on July 8, 2021

http://mangrove.cz/diffmark/ (full disclosure: I wrote that), for example...

beermonster · on July 8, 2021

I usually c14n them and then diff

Wilfred · on July 8, 2021

Author here! Happy to answer any questions :)

bifftastic · on July 8, 2021

I like the name

secondcoming · on July 8, 2021

How can he have crashes if it's written in Rust?

dcminter · on July 8, 2021

Probably meaning 'panic' - e.g. unwrapping a Result without allowing for an error result.

Wilfred · on July 8, 2021

Yep! There's a lot of .unwrap() and .expect() in the codebase, so it panics. Since it's Rust, you get a line number and an error message rather than a segfault.

I will tidy it up at some point, but I spend too much time throwing away ideas that don't work. Defensive code is silly if you delete it after!