I worked on this problem for just over a decade before moving onto other things. It’s a tough problem to solve.
The biggest problem I ran into is that the largest segment of user growth were too fickle. They wanted all kinds of magic in new optional features for their personal preferences that took incredible effort. I lacked the analytics to see who used which exotic features. Most of these people just wanted a beautifier more that a diff tool and would drop you in a heartbeat for more popular tools that wouldn’t do what they wanted but were popular.
The tool I wrote did have a strong following mostly around markup language parsing that was not at all exotic but solved problems other tools refused to approach.
My guidance is don’t become a code beautifier. In the languages I was supporting during the time frame I was supporting this code beautifiers were all the rage. Nobody seemed to want a diff tool with extra capabilities. Stick to being a diff tool. The people that are intentionally looking for intelligent diff tools tend to be more engineering focused and make for a loyal audience. People looking for code vanity are just the same as window shoppers walking down a street.
Thanks, this is good advice. There are some super featureful diff tools out there. For example, https://github.com/dandavison/delta does a line-based diff but it also syntax highlights its output.
I'm hoping that defining syntax in a separate TOML file will let end users extend difftastic for their own languages/config files. I want to keep difftastic small and manageable.
Those small things improved development and reviewing a lot for me!
If stuff moved around or got it's indentation changed I either add `--color-moved` and/or `-w` (ignore whitespace changes) flags to filter out extra noise.
Sometimes I need to use another regex though, e.g. a simple dot `.` for match all with no greedy +
Been hoping for more of this for years. We stare at diffs all day yet we have to accommodate the computer by understanding that the parenthesis it claims was changed wasn’t actually changed, there was just another set of parentheses added. There’s of course limits to how much a diff tool can extract meaning from two pieces of content, but structure and perhaps even heuristics like “new function was added here, maybe the curly brace belongs with that and not the old function” would certainly help.
This is cool. I work with a large Python codebase that's been around for a while (hasn't been that long since it's been completely off Python 2). It's not a bad codebase, but naturally it has almost no typing annotations, and other devs are just starting to come around to the idea.
I love `mypy --strict` but it produces way too much noise with all the other code, so I've been using a tool I made that runs mypy over the changed files and then greps the output to filter for lines that `git diff` points out have changed. It's pretty rough and imperfect (doesn't catch errors that started appearing in unchanged files, or unchanged lines it the same file, for instance) but it's still quite helpful. I've been meaning to make an improved version that runs mypy over the branch point, then over my branch, and then maps new and changed lines between them so it displays all errors that are new, but I haven't gotten around to it yet. It'd be useful for other tools too, like semgrep.
I'd forgotten about this, but years ago I was working in a huge PHP nightmare codebase.
So, I hacked together a pre-commit hook that blocked the commit only if the configured style checker registered errors on lines being added by the diff.
It never got very polished, but I wound up using it in two codebases over the years.
To ease the pain in conventional differs, we use a pre-commit hook to format the source code (prettier). This way we only see differences if something _actually_ changed.
I think code formatting should be mandatory and one of the first things you adopt in your project. Resist code style rule changes as much as possible, and if you do, apply them across the whole codebase in one go to avoid churn and noise in diffs down the line.
And if you do make style changes, put them in a separate commit at the very least so the diffs are cleaner and code reviews are easier.
In my project I use gofmt (goimports) for back-end code and prettier for front-end; I've configured my editor to apply those on save, and a pre-commit hook to either run the formatter, or error if the formatting is not according to the spec.
One of Go's proverbs is "Gofmt's style is no one's favorite, yet gofmt is everyone's favorite.". Consistency and low noise is more important (in that case) than a specific code style preference.
gofmt is in a small set of formatters that disallows configuration and choice, it's also the first to my recollection. This is a feature because deciding on a coding standard is bike shedding. It's also extremely aggressive, and undoes pretty much any choice you may make in formatting your code; a feature, again.
Not all languages are this fortunate. Some have configuration (cargo fmt), others aren't aggressive enough (Roslyn).
When I was still formatting code manually (... -ish; emacs did a lot of the tedious work for me) I eventually settled on a style very similar to your four-argument example, precisely because it makes diffs easier to read.
For the same reason, I asciibetically sorted things when it made sense.
Now I use prettier and black and I'm mostly satisfied by them, but their reflow behavior puts the lie to "[black] makes code review faster by producing the smallest diffs possible."
This seems like a really difficult issue to solve in the general case, but I found that solving it for the specific case I had was a tractable problem.
I have a need to diff the output of a query and compare it to the last time it ran, to do regression testing. Just diffing the resulting CSVs wasn't very useful, because I needed the ability to do things like ignore new columns, and report the exact column that had differences from the previous version.
I was able to do that by defining a primary key on which I could outer join the two tables. Missing or new rows would be the ones that didn't join, and then I could do a per-column comparison for each row that did join.
It's a great idea, but I don't think defining the syntax of a programming language as a syntax.toml file will work for enough programming languages for this to be useful. You're basically rewriting the parser of your language in a DSL that isn't as expressive as the language the parser is written in.
I think you'd need another parser/syntax interface for this to work. E.g. running a binary that you can submit source code to which responds with a JSON file containing the parsed tokens. That way you can reuse the compiler's parser.
Yeah, so it's basically a lexer with an extremely simplistic parser.
Compiler parsers aren't a great fit for difftastic. They discard comments, they may not give you output if there are syntax errors, and they're usually tied to a specific language version.
FWIW, as soon as I saw this project, I wondered specifically about Treesitter's applicability to this problem, and I found https://github.com/afnanenayet/diffsitter.
I think some of the issues are alleviated because tree sitter was made with text editors in mind, so you're not really getting a compiler parser as editors still care about comments and whatnot. It's also fast, which was a big motivation to use it.
Difftastic does not create a patch or worry about merging. That's a hard problem that I'm not trying to solve. Instead, it builds two ASTs, then marks each node as unchanged or novel.
To compute the diff, I use a graph search. Each vertex represents a position in both the left and right ASTs.
Suppose you're comparing A with X A.
Start node:
Left: A Right: X A
^ ^
The possible next nodes are:
(1) Treat A on the left as novel.
Left: A Right: X A
^ ^
(2) Treat X on the right as novel.
Left: A Right: X A
^ ^
Both (1) and (2) are the same 'distance', but (2) is closer to the end node, because there's a edge from (2) to the end that marks A as unchanged.
I've implemented this using Dijkstra's algorithm. My graph is directed and acyclic, so there are faster algorithms like topological sort. However, I don't construct the whole graph in advance (that would take O(N^2) memory) so instead I construct the graph nodes as necessary.
(This is very similar to Autochrome, which I've linked in the README. Autochrome has a worked example which is really helpful.)
At some point I think I'll have to use A* search instead. If there are more than 500 lines of code with lots of changes, difftastic can take a few seconds to terminate due to the naive graph search.
Thanks for the reply Wilfred! I was not familiar with Autochrome, I will certainly have a look!
That's interesting, I like the idea of not worrying about patching nor merging, giving you a tool that is focused on "communicating the differences to a human", and indeed, it means you don't have to worry about a whole bag of problems.
One insight that I came across (more info on Chap 5 of my thesis) is that not considering or handling duplication means you incur a quadratic slowdown in your search algorithm. For example, say you're diffing `A` against `Bin A A`. If you can't understand that `A` was duplicated, which `A` do you copy? You have to evaluate both options even though it really doesn't matter which one you copy.
One good middle ground for speeding up your algorithm while not having to worry about displaying duplications is to have an intermediate step where first you diff with duplication detection, but then you just go over the result and make arbitrary choices about which duplicate to copy and which to insert/delete.
I think the diffing is the “obvious” graph search algorithm between trees, where a “tree” is a list of atoms or trees (think lisp lists).
Basically to diff a tree of n top-level elements against one of m elements, construct a graph where nodes lie on an (n+1)x(m+1) grid. Each node (a,b) corresponds to having looked at a elements of the first and matched them to b elements of the second list. Add edges (a,b)->(a+1,b) for deletion; (a,b)->(a,b+1) for insertion; and (a,b)->(a+1,b+1) for an inner diff (ie basically this graph search problem again). Choose weights to apply to node and now find the shortest path from (0,0) to (n,m).
From you description it seems like we're just computing the standard insert-delete tree-edit-distance. These tend to be slow.
This implies that the patch language only supports insertion, deletion and modification of nodes, which is a shame since refactorings, moves and duplications are also common operations in the source-code domain. Additionally, if the patch language only supports insertion, deletion and modification, the merging algorithm will perform poorly.
Yep, that's a fair description. Note that I'm not providing a merge algorithm, just a pretty way of viewing changes.
I did look at modelling moves in an earlier prototype, but it's incredibly hard to display the result in a coherent way when there are also insertions. It was also easier to drop it when I moved to Dijkstra.
As you can see in the screenshot in the readme, it does support inserting tree nodes whilst preserving children, which covers a ton of cases.
My favorite diff tool is the one shipped with Plastic SCM, Xdiff. Since it's visual, it makes it very easy to see what changes have been done to the file.
And theirs Git UI, gmaster, includes the same diff tech too.
I do not use it for everyday tasks, but when I need to understand complex and/or big changes this is my go-to tool.
Never really worked all that well. I looked for research on how to diff something like that but didn't find anything useful. IIRC the diff works by "serializing" the cell grid, effectively treating each cell as a separate "line" and then running that through a conventional line-based diff algorithm.
I'd feel so much more motivation for checking out alternative diff tools if there was a better story for integrating them with the review tools in Github, GitLab, etc. I know there's nothing anyone can do about that— it's something the Git hosts themselves have to enable, or I have to see enough benefit in it to go to an dedicated review tool to make the bother of that worthwhile.
I believe Gerrit has a pluggable diff— is there anything more broadly on improving this story?
I still look at diffs in the terminal pretty often, but all my code reviews are in rendered HTML.
That said, there needs to be a credible tool before review tools can adopt it! GitHub does a line-based diff with word-based highlighting, which is probably the best you can do without syntactic smarts.
It would be neat if there was a way to supply a "diff hint" or something right in your git commit metadata. Obviously the receiver/reviewer/renderer can ultimately do whatever they want, but it would helpful if I as the one preparing the change could at least specify intent.
I guess projects like the kernel where the review system is built around emailed patches kind of already get this for free— once committed, the change will be rendered according to the local user's git settings, but during the review itself, it will be a diff prepared by the change's author that will be under discussion.
In a glorious future where GitLab has four different diff options, it would be great if I could specify that I want it to default to the hinted diff tool, falling back to my preferred one if there is no hint.
I agree sliders are a problem, and I hope to have a solution there.
Syntactic differs already do better because they understand that parentheses/brackets are paired. Difftastic does OK with this example: https://imgur.com/a/pVlVBo5
FWIW, the formatting of the snippets I wrote above was as two separate diffs for additions with the new additions (ie green parts) represented with [square brackets].
I wrote diffr [0] for that purpose; it serves me well, especially if your team makes code with long lines.
In my opinion, a simple approach that does NOT make any parsing is more efficient (what about bugs in your parser? code with syntax errors? also, how fast would the parser be?)
Tree-sitter is great, but I find it could do a better job with broken code. This is particularly important when parsing things like C or C++ where the preprocessor makes it likely that unpreprocessed code can't be parsed anyway.
syntax-aware semantic history would be incredibly useful for code review and codebase archaeology. better detection of global rename, refactorings, and moves would make CR diffs way less messy.
code review is a necessary but painful part of releasing good code on a team; anything that makes it slightly easier is a force multiplier for companies whose main bottleneck is software
Yep! There's a lot of .unwrap() and .expect() in the codebase, so it panics. Since it's Rust, you get a line number and an error message rather than a segfault.
I will tidy it up at some point, but I spend too much time throwing away ideas that don't work. Defensive code is silly if you delete it after!
The biggest problem I ran into is that the largest segment of user growth were too fickle. They wanted all kinds of magic in new optional features for their personal preferences that took incredible effort. I lacked the analytics to see who used which exotic features. Most of these people just wanted a beautifier more that a diff tool and would drop you in a heartbeat for more popular tools that wouldn’t do what they wanted but were popular.
The tool I wrote did have a strong following mostly around markup language parsing that was not at all exotic but solved problems other tools refused to approach.
My guidance is don’t become a code beautifier. In the languages I was supporting during the time frame I was supporting this code beautifiers were all the rage. Nobody seemed to want a diff tool with extra capabilities. Stick to being a diff tool. The people that are intentionally looking for intelligent diff tools tend to be more engineering focused and make for a loyal audience. People looking for code vanity are just the same as window shoppers walking down a street.