Random collisions in 160-bit space are incredibly unlikely. This is talking about intentional collision, and means that it's entirely feasible for someone with significant compute power to create a git commit that has the exact same hash as another git commit. This could allow someone to silently modify a git commit history to e.g. inject malware or a known "bug" into a piece of software. The modified repository would be indistinguishable if you're only using git hashes.
Git's uses SHA-1 for unique identifiers, which is technically okay as long as they are not considered secure. If git were designed today it would probably use SHA2 or SHA3 but it's probably not going to change due to the massive install base.
Edit: anyone know if git's PGP signing feature creates a larger hash of the data in the repo? If not maybe git should add a feature where signing is done after the computation of a larger hash such as SHA-512 over all commits since the previous signature.
The defence used by GitHub specifically defends against these intentional collisions, not some mirage of random collisions.
Basically you collide a hash like SHA-1 or MD5 by getting it into a state where transitions don't twiddle as many bits, and then smashing the remaining bits by brute force trial. But, such states are weird so from inside the hash algorithm you can notice "Huh, this is that weird state I care about" and flag that at a cost of making the algorithm a little slower. The tweaked SHA1 code is publicly available.
If you're thinking "Oh! I should rip out our safe SHA256 code and use this unsafe but then retro-actively safer SHA1" No. Don't do that. SHA-256 is safer and faster. This is an emergency patch for people for whom apparently 20 years notice wasn't enough warning.
In theory the known way to do this isn't the only way, but, we have re-assuring evidence for MD5 that independent forces (probably the NSA) who have every reason to choose a different way to attack the hash to avoid detection do trigger the same weird states even though they're spending the eye-watering sum of money to break hashes themselves not just copy-pasting a result from a published paper.
So, if I understand correctly: the patched SHA-1 code generates the same hash, but is has checks on the internal state so that it will flag inputs which are likely to be intentionally colliding?
It is not yet possible "to create a git commit that has the exact same hash as another git commit" in the sense that if someone else has already done a commit you can make another commit with the same hash.
What is possible now is something that is much easier: if you have enough money and time, you can create 2 commits with the same hash, which start with some different parts, which may be chosen arbitrarily, then they have to include some parts that must be computed to ensure that the hashes will match and which will be gibberish that might be disguised in some constants or some binary blob, if possible.
Then you can commit one of them and presumably you can later substitute the initial commit with the other one without anybody being able to detect the substitution.
It doesn't take any money or time. Google's break of SHA-1 was fully reusable. So long as committing a PDF to the repo counts, there's a script that will trivially concat two PDFs in such a way as that they each render to their original (different) contents, but both have the same SHA-1 hash. Put in a repo and `git add foo.pdf` and you're done.
Nope. Google's break of SHA-1 was reusable in the sense that you can add an arbitrary (identical) suffix to both PDFs and keep the collision. But Git does not use raw SHA-1 hashes, it adds a prefix before computing object hashes. Therefore, Google's break of SHA-1 cannot be reused to break Git.
In short, it's a method of storage where object's identity is derived from object's content (usually via hashing it). So the assumption is: same hash => same content => same object.
I would think it's not that simple because `git push --force` doesn't do anything if it thinks the histories are the same, and in this case we've created a history that appears to be the same. You'd likely need a custom git client (which isn't a problem) but I don't know enough about the internals of a git push to know if the server would even accept objects if they match hashes of objects it already has (it may just go "I don't need that" and ignore it, because what's the point in doing anything with it?). Presumably it would depend on the exact server implementation whether you could get it to replace an existing object with a "new" one which is actually the same hash, but frankly I think that's unlikely because it would be pointless work from the view of the server. If it does happen I'm not sure what auditing you would actually be able to see, webhooks and such might not be triggered because the history didn't actually "change".
What you could do however is just host it yourself somewhere else, say put it on a fork. Or if you have access to the actual repository hosting the original version, you could just manually replace it yourself. git clients aren't going to just automatically pull the "new" version though so you'll have some combo of people with the old and people with the new, and it gets a little messier from there.
If you can force push, why would you then not push a different commit before pushing your updated commit with the same, original hash, or does that also not work?
I wouldn't expect that to work reliably because git doesn't actively remove unused objects from the store, hence why you can do `git reflog` to go find stuff that's not actually referenced anywhere anymore. `git gc` is necessary to make them actually go away, and whether that ever happens and how often is up to the server. I know for example that Github practically never does it, and even if it did happen it would be hard to reliably ensure no references to the relevant object still exist anywhere on the server. For example, you would have to force push every branch on the server that references the object, and if the git server creates branches for internal use you might not be able to touch those or convince the server the object is unused. And even if all the references are gone if the object is never actually deleted from the server then it should just get used again if you try to do a `git push`.
As of January 2021, SHA256 repositories are supported, but experimental. They can be created with `git init --object-format sha256`. If I understand correctly, they don't mix at all with SHA1 repositories (i.e. you can't pull/push from/to between SHA1 and SHA256 repos).
Here is discussion [1] on the issue on the Git mailing list at the time with some useful context WRT git. Unsure what current status is. Git 2.29 Had experimental support for sha256.. not sure what it’s current status is, but that was a year ago.
I was surprise that no one suggested truncating SHA-256 to 160 bits (same as for SHA2-256/224, or SHA2-512/256). The attacks on SHA-1 are not directly based on the length of the hash, they are based on weaknesses in the algorithm.
Even attacking SHA2-256/128 would be quite difficult as I understand it, even though it's the same length as MD5.
Truncated hashes also of course have the great property that they mitigate the length extension in Merkle-Damgard
> Truncated hashes also of course have the great property that they mitigate the length extension in Merkle-Damgard
To be fair, this is totally irrelevant to git, since the attacker knows the whole message and can just recompute the extra bits themselves. That said:
> I was surprise that no one suggested truncating SHA-256 to 160 bits (same as for SHA2-256/224, or SHA2-512/256). The attacks on SHA-1 are not directly based on the length of the hash, they are based on weaknesses in the algorithm.
Very seconded. You could even shove the extra 96 bits in a optional metadata field and have new versions of git throw up a giant air-raid-siren-level error if they don't match (since that will never happen by accident[0]) and still have the full 256-bit-hash worth of security for most purposes. Git already allows (arguably encourages) people to truncate hashes to 28 bits or so at the UI level, so there's precendent for that already.
0: You do not have anywhere near 2^80 commits in the world, much less in the same repo.
Yep. GitHub is saying they would block an object that looked like it was crafted to produce a collision using SHAttered, and hasn't seen it, intentionally or otherwise, in the wild.
> This could allow someone to silently modify a git commit history to e.g. inject malware or a known "bug" into a piece of software.
You need a collision. You also need it to be syntactically correct. You need it to not raise any red flags if you are contributing a patch. And ultimately you need it to do what you want.
All you need is one variable sufficiently large that you can change until you find a collision, such as a comment (which about all languages allow) or a meaningless number.
You could even vary whitespace until it fits, like spaces at the end of lines.
Commits aren't patches. They contain the whole tree. Retroactively changing a commit can't possibly introduce conflicts with other commits on top of it, the worst it can do is introduce big funny-looking diffs.
It's true that semantically git commits store the whole tree but doing that naively would be inefficient. Instead, packfiles will store some objects as deltas which could either result in inconsitencies or noticeable knock-on changes if the original object contents are changed.
While that's true, I'd be very surprised if git delta-compressed the commit objects themselves. Changing a commit to point at a different tree wouldn't impact the delta-compressed packings of any file blobs, it would just change the actual file the commit points to.
For example, suppose you started with a commit graph that looked like this:
Where C1, C2 and C3 are commits; T1, T2 and T3 are the trees they reference; and F1, F2 and F3 are three versions of a file blob stored delta-compressed in your packfile. Then if you had a malicious version of C2 with the same hash you could replace C2 with a new commit C2' pointing at a new tree T2' with a new file object F2', and nothing would break. The resulting commit graph would look like this, and F1, F2 and F3 would all still be in your packfile delta-compressed and accessible, just with nothing referencing T2/F2:
Regardless, this is all moot to some extent. The attack most everyone talks about is that if you were in control of a central git repository (for example if you were hosting a mirror of an open source repository), you could give two different versions of that repository to different people without them being able to tell, even if they were checking PGP signatures or referencing specific git hashes. For example you could serve the non-malicious files to human developers, and when a user-agent that looks like a CI/CD pipeline such as Jenkins or the Ubuntu/Debian/RedHat packager's build machine or someething clones the repository to build a specific hash requested by the user, give it a malicious version of the source tree that builds a backdoor into the binaries it creates. In this sort of attack you never have to "change" a git object on someone's machine which is something the git protocol naturally isn't designed to do because it never happens naturally.
Indeed. I'd add that this is, in my, experience one of the biggest sources of misunderstanding for people new to git. It isn't helped by the fact that a lot of git introductions (well-meaningly) emphasize diffs between commits.
Darcs (http://darcs.net/) is an example of a truly patch-centric DVCS. While I think git is great and that its ubiquitousness has made the world better, I'm always a bit sad when reminded of what could have been with Darcs (for all its problems).
Well, the contents of the commit is a patch plus metadata. They point to a parent commit, and layer themselves in the tree.
The problem would be if a clone doesn't fetch the new version of the patch and generates a new commit that would conflict with the modified commit. You're changing the base all the future diffs are based off of. It might just jumble the source essentially corrupt the file, but I'm not sure.
Contents of a commit is not a patch, it's the whole tree. The got ui presents it as a patch, but that's generated at runtime by the "git diff" command.
It does internally use delta compression to save storage space, but it's not necessarily a straight delta between a commit and its direct ancestor (and that's just an internal optimisation).
From a practical point of view, how would injecting malware happen? If you're trying to insert a malicious diff somewhere deep in the git history, you would need to recompute all the other commits after the injected commit - which would most certainly change their commit ids too if they are touching the same file. When other commit ids change, the malicious change becomes detectable.
There's also the case for auditing: force pushing into an existing repo triggers an event in GitHub and is logged. While the logging event can be missed, it leaves a paper trail.
With things like reproduce-able builds, this also becomes harder. Distributing (through a means of a fork, or putting it up on a website mytotallylegitgitrepos.com) source code which builds into a binary which doesn't match upstream hash is suspicious.
Presumably the attacker would modify the most recent commit which edits the file that they are targeting. It is true that the attack becomes more difficult if you try to target an older commit.
Auditing helps if they try to force push the original repo, but doesn't protect vs someone redistributing malicious clones of the repo.
Reproduceable builds do help, but only for projects that can take advantage of it...
Every commit references every file. If you change the content of an old commit you would only affect people who check out the old commit. So this is utterly pointless and not what someone would do.
Instead what you would do is attempt to make a file-object that has a certain SHA1 hash identifying it, and a colliding file-object that has the same SHA1 hash. Then you are free to give people who clone the repository different file contents depending on when/who/how someone requests it (if the file content is hosted on github, how to change the file object identified by a given SHA1 hash is an additional hurdle since it's assumed to be immutable and indefinitely cacheable; if you control the host yourself you can just change it whenever you like).
> A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.
Just to nitpick, I don't think that formula is valid. We're primarily interested in "unrelated" wolf attacks, but it counts the total fatalities, not the total number of fatal incidents. If we count each fatal attack as only one incident, regardless of the casualties, we get 2^-258 instead.
But of course we also need to take into account where the 6-member team lives. If they all live in West Bengal, India, the consideration is much different than if our developers live in Atlanta. Atlanta doesn't have any wild wolves. There is a Wolf's guenon in the zoo, but that probably doesn't count as a risk because they mostly eat small animals and also are monkeys.
This is why I always get mad when people say something like "you are more likely to be struck by lightning than eaten by a shark!".... we'll, that REALLY depends on where you are.
There's only 10 fatal shark attacks per year. What's your calculation for the sharkiest area to live? It has to be something like 100+ times sharkier than average for your claim to be true. And keep in mind that half the US population can easily day trip to the ocean.
Edit: Actually, that's using a number of 2000 lightning fatalities which might be 10x too low. And lightning injuries are another 10x higher than that. So you'd need somewhere that shark attacks are a thousand or ten thousand times more likely than average. That's also without interpreting "eaten" literally...
You focused on the shark side and not on the lightning side. You will never get struck by lightning in a city, so there the probability is 0. Similarly if you go out swimming in the ocean on a sunny day the probability of getting struck by lightning is also 0, but the probability of getting eaten by a shark is not 0.
You're going too granular. This is about populations that live in certain areas, not specific people on specific days. "depends on where you are" when talking about an over-time risk ratio. The goal isn't to find some dude with weird habits.
And while a population that mostly stays inside a city has a reduced lightning chance, it also has a reduced shark chance. I don't think that's anywhere near the point of equalizing the rate.
If you have an idea for a factor that dramatically reduces lightning risk but doesn't reduce shark risk much, I'm interested.
Like I said in my other comment, it's not about finding one specific guy. Picking a place where people live is a reasonable starting point for making a rebuttal to a general statement like that. I don't think pointing at Fisherman Sam is.
> And if I live in a place where there is very rarely lightning, that chance gets really low as well.
How low can that number go? The numbers I picked made consideration for some amount of variation in lightning. Does lightning have a huge variance?
It's important to have real numbers when you're making the claim that there are places where the lightning:shark ratio is multiple orders of magnitude lower than the average. That's not a claim you can justify by merely pointing out that the risks will vary by location.
That's a good point! I didn't think about either of those issues. The second one raises the average team's chances of such a spate of unrelated attacks significantly, if anything related to this formula could be said to be significant, because the increase in risk to teams in West Bengal and Ukraine is enormously larger than the decrease in risk to teams in Atlanta.
This is problematic because it must be 2774 years ago, when wolve attacks were substantially more common. I'm also not thrilled that a good chunk of my dev team lives near the Lupercal. The one good piece of news is that at least one of my headcount is fated to die by the hand of man, which might bring the odds of simultaneous wolf attack deaths down to zero unless he quits beforehand.
> Aside from being bullshit, it's also irrelevant, since we're discussing a collision being generated on purpose, not by accident.
I think you just pointed out the error in your own reasoning. This is defending against a deliberate attack. Therefore, your proper odds would be that your programming team is deliberately set upon by 6 different wolves. So, have they offended people who have access to 6 wolves, and the time and inclination to train them (or hire others to) in an effort to pull off a murder spree?
Edit: Actually, the hash attack already assumes motivation and skill. So, I don't know what the odds would have to be computed. That at least one programmer on your team could fight off a trained attack wolf (to whatever level of "training" is the current state of the art for attack wolves)?
This is stupid. The probability of any attack happening at random is close to zero. E.g. the probability of a buffer overflowing just right to give you a shell through random chance is less than being eaten by wolves.
The question is about the risk of someone intentionally performing the attack, not the probability it will accidentally happen at random.
https://github.blog/2017-03-20-sha-1-collision-detection-on-...