This week Google published a SHA-1 collision
There’s a lot of confusion about the implications of this. A lot of this is due to differences of opinion on what exactly constitutes a “new” collision. I tweeted about this
. The webpage for the attack itself is misleading, saying that the answer to “Who is capable of mounting this attack?” is people with Google-esque resources. This depends on what exactly you mean by “this attack”.
So I’m seeing a lot of “oh well just another anti-milestone for SHA, doesn’t affect anyone since its still quite expensive to exploit” reactions, as well as the opposite “aaaaa everything is on fire” reaction. Both are wrong. It has practical implications for you even if you are certain that you won’t attract the ire of an entity with a lot of computational power. None of these implications, however, are likely to be disastrous.
TLDR: Now anyone
, without needing Google-esque resources, can generate two colliding PDFs with arbitrary visual content in each.
(In fact, there’s already a PDF collision-generator
up where you can upload two images and get a PDF with collisions in it)
Okay, back up a bit. What’s a hash? What’s SHA-1?
I explained this a bit in my older post about zero-knowledge-proofs
In essence, a hash function takes some data (usually of arbitrary size), and produces a value called a hash
(usually of fixed size). Thd function has some additional properties:
- In almost all cases, a small perturbation in the input will lead to a large perturbation in the hash
- Given an input and its hash, it is computationally hard to find an alternate input producing the same hash
- It’s also hard to just find two inputs that has to the same value, though this is usually easier than the previous one
when two inputs hash to the same value, this is called a collision. As mentioned, is easier to find a
collision, over finding a colliding alternate input for a known input.
SHA-1 is one such hash function. It’s been known for a while that it’s insecure, and the industry has largely moved off of it, but it’s still used,
What did the researchers do?
They found a hash collision for SHA-1. In essence, they found two strings,
SHA1(A) == SHA1(B)
, given the way SHA-1 works, this means that you can generate infinitely many other such pairs of strings. And given the nature of the exact
they created, it is possible to use this to create arbitrary colliding PDFs.
Basically, SHA-1 (and many other hash functions), operate on “blocks”. These are fixed-size chunks of data, where the size is a property of the hash function. For SHA1 this is 512 bits.
The function starts off with an “initial” built-in hash. It takes the first block of your data and this hash, and does some computation with the two to produce a new hash, which is its state after the first block.
It will then take this hash and the second block, and run the same computations to produce a newer hash, which is its state after the second block. This is repeated till all blocks have been processed, and the final state is the result of the function.
There’s an important thing to notice here. At each block, the only inputs are the block itself and the hash of the string up to that block.
This means, if
are of a size that is a multiple of the block size, and
SHA1(A) == SHA1(B)
SHA1(A + C) == SHA1(B + C)
. This is because, when the hash function reaches
, the state will be the same due to the hash collision, and after this point the next input blocks are identical in both cases, so the final hash will be the same.
Now, while you might consider
to be the “same collision” as
, the implications of this are different than just “there is now one known pair of inputs that collide”, since everyone now has the ability to generate new colliding inputs by appending an arbitrary string to
Of course, these new collisions have the restriction that the strings will always start with
and the suffixes will be identical. If you want to break this restriction, you will have to devote expensive resources to finding a new collision, like Google did.
How does this let us generate arbitrary colliding PDFs?
So this exploit actually uses features of the JPEG format to work. It was done in a PDF format since JPEGs often get compressed when sent around the Internet. However, since both A and B start a partial PDF document, they can only be used to generate colliding PDFs, not JPEGs.
I’m going to first sketch out a simplified example of what this is doing, using a hypothetical pseudocode-y file format. The researchers found a collision between the strings: