Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Reading this article, I feel as though the author doesn't deeply understand git.

git works on blobs of data, not files, and not lines of text. It doesn't just happen to also work on binary files- that's all it works on.

Now, if the author is suggesting that git-diff ought to have a language specific mode that parses changed files as ASTs to compare, now I'm interested. Let's do that. I'll help!

But git does not need to change how it works for that to happen. Git does not even need git-diff to exist to serve it's main purpose.



Note that git does work with diffs a lot.

Rebases and cherry-picks work by applying diffs, not by copying blobs. Auto-merging also needs to look at file content as text, you can't auto-merge a binary file with git.

It's an often repeated fact that if you look inside Git, it doesn't work with diffs, it works with blobs. But if you look closer, it's often diffs again!


With cherry-picks (and thus rebase), you ask git to turn a commit into a patch, so it does just that.

I would mostly consider auto merges (which I guess are bolted on) as the main area where git itself uses diffs during resolution and even then only as a suggested resolution (you get warned and need to confirm it when validating the merge).

So no, it's blobs all the way down. Darcs and Pijul are patch based though.


It's true that git is blob based, as opposed to patch based, but it's not the full picture! In practice, git stores a lot more diffs (or rather, deltas) than it stores loose blobs. (And you probably know this already, but I feel it's still worth making explicit)

This is necessary, because when a repo accumulates commits, it becomes a lot more efficient to store most of the objects as deltas instead of separate blobs. If Git didn't do this, it would have a lot of copies, and they would take a lot of space.

So the fundamental model of git is truly based on blobs in theory, but in practice many or most git commands will operate on packfiles, and if you look in your .git object store, most likely you will have a few big packfiles containing most objects, and then a much smaller collection of loose blobs.

All those diffs are what the "resolving deltas" progress indicator that people see when they do a big clone, fetch, or checkout is about =)


> In practice, git stores a lot more diffs (or rather, deltas) than it stores loose blobs.

The diffs it stores are not the diffs you see in git diff.

They're rolling checksum based chunks. The data that the delta is computed against is picked with a heuristic ("sort by name and date, try the top 10, and use the smallest result"). And, in practice, the heuristic diffs the older files against the newer ones, rather than diffing in chronological order, so that getting recent data doesn't involve a lot of delta application.

The git deltification is better thought of as a compression method than as diffing.


Packfiles and deltas are a storage and transfer optimization for blobs. Any access to them store and yield blobs.

It is for all intents and purposes just an internal serialization format, akin to how a filesystem is just a serialization format that makes all your data one large stream. One generally talks about the provided interface (files for a filesystem, blobs for git) rather than where the bits actually go.

Compression algorithms are also to some extend diffs as they serialize to a sequence of "repeat previous segment and add this new data" commands, but it is not useful to consider them as such.


Merges, rebases, cherry-picks, are all the same kind of thing. A merge is essentially a rebase that squashes all the commits being picked.


No, a merge works very differently.

A merge is a commit with two parent commits, pointing to a new tree that contains the blobs from both parents. It does not modify any blobs, nor does it modify the parent commits. The full history of all activity is retained.

A merge conflict is a case where both trees changed the same blob since their common ancestor. In this case, you have to make a new blob yourself (the "resolution") for use in the merge commit's tree, instead of using one of the parent blobs.

Squash is "Remove all commits from C_newest to C_oldest, and create a new commit using C_newest tree". Rebases just run another git action for every commit in a sequence, e.g. cherry-pick.


> A merge is a commit with two parent commits, pointing to a new tree that contains the blobs from both parents.

The second parent is metadata. The way a merge works is essentially to compute the commits you need to cherry-pick, then cherry-pick them without committing, resolving conflicts in pretty much the same way git cherry-pick does, THEN commit with two parents.

> It does not modify any blobs, nor does it modify the parent commits. The full history of all activity is retained.

A new commit containing merged content is created, as well as a merge commit with the second parent that documents that a merge happened and what was merged.

> A merge conflict is a case where both trees changed the same blob since their common ancestor. In this case, you have to make a new blob yourself (the "resolution") for use in the merge commit's tree, instead of using one of the parent blobs.

git merge does the same thing for automatic (and manual) conflict resolution as git cherry-pick. So does git-rebase.

> Squash is "Remove all commits from C_newest to C_oldest, and create a new commit using C_newest tree". Rebases just run another git action for every commit in a sequence, e.g. cherry-pick.

Rebasing is constructing a set of operations:

  - construct a set of commits to pick as the
    commits between (the merge-base of HEAD
    and the selected commit) and the selected
    commit
  - git checkout the --onto HEAD
  - cherry-pick the selected commits
An interactive rebase lets you drop commits, add commits, edit, reword, or fixup/squash commits.

Squashing a commit is essentially doing `git cherry-pick --no-commit` of the to-be-squashed commit and then `git commit --amend` to replace the HEAD commit with a new commit that includes the changes staged by `git cherry-pick --no-commit`.

Yes, it really is this simple. I aver that it is easier to understand the above than to think of merging and rebasing and cherry-picking as fundamentally different operations.


> The second parent is metadata. The way a merge works is essentially to compute the commits you need to cherry-pick, then cherry-pick them without committing, resolving conflicts in pretty much the same way git cherry-pick does, THEN commit with two parents.

Good luck explaining an N-way merge with this approach, such as the 66-way "cthulhu merge" that is 2cde51fbd0f3 in the linux tree.

All parents are metadata, they do not contribute to the content of the commit other than their "parent" line in the commit object after the merge finished.

> A new commit containing merged content is created, as well as a merge commit with the second parent that documents that a merge happened and what was merged.

A merge only produces one commit: The merge commit, pointing to the tree of the merged content. It is a completely normal commit, having multiple parents like any commit can.

The tree of the merge commit may contain new blobs not present in any of the parents if conflict resolution was required. Otherwise, the new tree is simply a combination of the parents' trees.

> Rebasing is constructing a set of operations: <snip>. An interactive rebase lets you drop commits, add commits, edit, reword, or fixup/squash commits.

Yup, that's what I wrote.

> Squashing a commit is essentially doing `git cherry-pick --no-commit` of the to-be-squashed commit and then `git commit --amend` to replace the HEAD commit with a new commit that includes the changes staged by `git cherry-pick --no-commit`.

I think most associate squashing with the act of reducing a foreign branch into a single new commit as a merge strategy (as opposed to fast-forward or merge).

What I described was squashing commits on the current branch, while you're describing squashing a single foreign commit into HEAD. Technically neither is what `git merge --squash` does, as that doesn't produce a commit at all.

> Yes, it really is this simple.

Well, I find your description complex (and having resulting inconsistencies) as it tries to describe plumbing in the terms of porcelain, which is backwards and honestly one of the main reasons I think people are confused about git.

But each to their own I guess.


> The tree of the merge commit may contain new blobs not present in any of the parents if conflict resolution was required. Otherwise, the new tree is simply a combination of the parents' trees.

Sure, but for me this is the common case.

> I think most associate squashing with the act of reducing a foreign branch into a single new commit as a merge strategy (as opposed to fast-forward or merge).

I don't. I git rebase -i often to squash commits.

> But each to their own I guess.

Whatever works. However, I find that when people focus on the semantics of merging, they then don't care to understand cherry-picking or rebasing, and they miss out on those very useful concepts. Whereas understanding what the process looks like helps one (me anyways) unify an understanding of all three concepts. I much prefer understanding one thing from which to derive three others than to understand those three things independently.


There's also a historical angle here that's important to inspect - Git was designed to specifically be content agnostic. There are some predecessors in the SCM space (like VSS) that are specifically language aware and allow the checking out of line ranges (pinning them so that no one else will make a conflicting change specifically) and even entire functions - these systems can cause a lot of grief while failing to protect the logic they're specifically trying to protect. As the warts on SVN got more and more visible I think the general assumption was that the replacement SCM would come out of this code aware space - but it didn't and in retrospect we all dodged a huge bullet when that happened.

I absolutely adore tooling around git that makes diffs more visible - one thing I absolutely gush over is anything that can detect and highlight function reordering... however, the core process of merging and rebasing and all that jazz - I don't think we're going to find anything automated that I'll ever trust when I'm not working on a ridiculously clean codebase - minor changes can have echo effects and when two people are coding in the same general area they need to be aware of what the other person is trying to do.


I dunno I feel like you're focusing on a detail that's not particularly relevant. The author's main thrust is precisely what you described about parsing changed files as ASTs.


It isn't relevant to the author's vision of content-aware diffing, but it is relevant to the author's complaints about how Git's (alleged) text-based-ness makes Git awkward to use with Jupyter notebooks. Has the author tried searching the web for "git diff jupyter"?


You can already choose different `diff` programs to use for particular filetypes. E.g., nbdime for Jupiter notebooks:

https://nbdime.readthedocs.io/en/latest/vcs.html#git-integra...


Pretty sure OP does understand, and is proposing what you deduced.

Incorporate some semantic understanding of the version controlled data into the VCS. Currently this work is subcontracted to humans.


Maybe I'm misunderstanding. It's just lines like this:

> The text-orientated design of git reflects...

> The current version of git is also able to find differences in binary files.

> if we were storing information as ASTs, rather than lines of text

These all, to me, show a gap in the authors understanding of how git works. And that's okay- git is often easier to use than is to understand.

But if they had a better understanding, they could make their point far better. And without understanding, they won't be able to implement this idea.


The author is likely using "git" to mean "the entire typical git user experience that git users spend time looking at".

And, from that perspective, Git-the-UX definitely does work on line-oriented files.


The git extension on VSCode is already pretty good at doing diffs on jupyter notebooks.

I distinctly remember this not being a core feature of stock git and needing Jupytext to enable version control on notebooks. So, I feel like this sort of language specific stuff is already happening, but not in any unified product.


I'm in the process of building a programming language for UI designers, and realized that diffing the AST (or some other kind of object notation) would be far more useful and understandable. I'll probably be digging into this exact problem within the next year or so.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: