Obviously, any actual crypto-experts can feel to correct me if I got history and/or understanding incorrect here.
-----
Addenendum to my previous post.
1. Confusion -- Bytes should turn into other bytes in random-looking ways. For example, the byte 0x25 may turn into 0x88. Aka: S-boxes in 90s-era ciphers.
2. Diffusion -- Bit-changes should "spread" to as many bits as possible.
3. Invertible -- Invertible operations minimizes the loss of information. Encryption/Decryption must be invertible by definition, but even Hash-functions should be largely built out of invertible operations. Try to make confusion and/or diffusion steps invertible.
4. ADD / XOR / Shift / Rotate -- These operations are the more popular way to make Invertible Confusion/Diffusion functions today.
5. SBox + Galois Fields -- For AES (a 90s-era algorithm), SBox was the source of confusion, and Galois Field arithmetic was the source of Diffusion. I could explain why but that gets more complicated.
5. Testing -- Test your functions against linear cryptography (how is the input related to the output?) and differential cryptography (how is each input bit related to each output bit on a bit-by-bit basis?)
------
Obviously, hash functions (like SHA256) must be non-invertible by the end of it all. But you want to carefully think about where the source of non-invertibility comes from, and to minimize the loss of entropy/information at any particular step.
With these principles, its not very hard to make your own hash function. I'd suggest studying Bob Jenkin's "JOAAT" hash, just-one-at-a-time. Its a non-crypto hash, but it is probably one of the simplest hashes that uses the above principles: https://en.wikipedia.org/wiki/Jenkins_hash_function
> even Hash-functions should be largely built out of invertible operations
Why is that? I don’t understand why you should minimize the loss of entropy at any particular step. Is it to help resist collisions?
> I could explain why but that gets more complicated.
If you’d care to go into more detail, I’d love to hear it. It was my understanding that the AES S-box was the result of some nonlinear transformation of the input bytes.
> Why is that? I don’t understand why you should minimize the loss of entropy at any particular step. Is it to help resist collisions?
SHA 256 is built out of 64 rounds of an internal mixing function.
That is, SHA256 is basically a glorified mix(mix(mix(mix...(mix(internal_state))))))))))))))) function, with 64x mixes applied to the internal data per input.
Now consider how "mix" should be designed.
1. If "mix" is invertible, it means that every possible input "pigeonholes" to exactly one possible output. Mathematically, we'd call invertible functions "one-to-one" and "onto".
2. If "mix" is not invertible, then the above condition does not hold.
-------------
So lets look at the simplest non-invertible function for 3-bit numbers as an example. And see what happens when we use it as a "mixing" function.
Notice that 110 and 111 both "map" to 111. So what happens when we run mix(mix(mix(mix...(data)))) ??
Eventually, after about 8 rounds, all your data turns into "111" eventually. You're "losing" information. It means that everyone can predict that your hash probably will be 111 eventually.
000 was "unchosen" in previous function, and now we have one-to-one and onto . invertibility. As such mix(mix(mix(mix(...(data)))) will 100% depend on the input data.
Of course, this mixing function is simply "(data+1) modulo 8", which is insufficient for mixing purposes. (The "confusion" is pretty bad and very predictable, but you can see how "add +1" is actually kind of good at diffusion when "all the carries" line up... such as 011 -> 100 is changing a lot of bits from one mix). But you can imagine that even in this simplest example, the invertible function leads to a less-predictable internal state than the non-invertible function.
---------
Functions should also be "nonlinear". The above x+1 mixing function is bad because its linear (5x mixing functions can be simplified into x+5). It turns out that combinations of XOR, bitshifts, and ADDs could be non-linear, but you need to run a bunch of tests over the inputs/outputs to be sure that your particular mixing function is sufficient.
What differentiates a non-crypto hash from a crypto hash. Is there a fundamental difference between the two which prevents one from being used for cryptographic purposes?
A non-crypto hash is run enough to confuse "simple" statistical tests.
A crypto-hash is run enough times to confuse "difficult" statistical tests from a dedicated adversary.
For example, the JOAAT hash is a simple add / shift / XOR hash run 1x round per byte. A singular round is all you need to confuse the "simple" statistical tests such as birthday attacks, dice rolls, etc. etc.
SHA-256 is also a simple add / shift / XOR hash, albeit using 256-bits at a time instead of 8-bits at a time. It is a 64-round hash function. That is: it is round(round(round(round(data)))) 64-times deep.
--------
If you wanted to turn JOAAT non-crypto hash into a crypto-hash, a good first step would be to run JOAAT(JOAAT(JOAAT...() maybe 64x or 128x over the data.
Obviously, you'd need to run cryptoanalysis over the end result, and also test for non-linearity and other such properties. But "large number of rounds" is probably the most obvious difference between non-crypto and crypto hashes.
---------
In the 90s, functions like AES were designed with "maximum confusion / diffusion per round". In just 10 rounds, the bits are mixed up enough that it defeats linear and differential cryptography.
I don't know which cipher "changed the meta", but maybe it was the XTEA cipher? Since XTEA was so simple, they knew it would need many-many rounds to defeat linear and differential crypto. So they just did 64-rounds and called it the end of it.
----------
So you can see, a "crypto" hash is effectively a regular old hash done an absurd number of times to defeat the most incredible statistical-tests known to man... done under the assumption that the "opponent" is performing the most difficult statistical tests.
Yes! We’d like our cryptographic hash functions to be collision resistant and preimage resistant.
That is, we’d like it to be hard to generate 2 different messages m1 and m2 where the hash of m1 is equal to the hash of m2, and we’d also like for it to be hard to compute any function of the message m(except the hash of m) if you’re given only the hash of m.
Non cryptographic hash functions don’t require these properties, and in fact some hashing algorithms used for data mining are designed to, for example, map near inputs to near outputs.
https://news.ycombinator.com/item?id=30248439
Obviously, any actual crypto-experts can feel to correct me if I got history and/or understanding incorrect here.
-----
Addenendum to my previous post.
1. Confusion -- Bytes should turn into other bytes in random-looking ways. For example, the byte 0x25 may turn into 0x88. Aka: S-boxes in 90s-era ciphers.
2. Diffusion -- Bit-changes should "spread" to as many bits as possible.
3. Invertible -- Invertible operations minimizes the loss of information. Encryption/Decryption must be invertible by definition, but even Hash-functions should be largely built out of invertible operations. Try to make confusion and/or diffusion steps invertible.
4. ADD / XOR / Shift / Rotate -- These operations are the more popular way to make Invertible Confusion/Diffusion functions today.
5. SBox + Galois Fields -- For AES (a 90s-era algorithm), SBox was the source of confusion, and Galois Field arithmetic was the source of Diffusion. I could explain why but that gets more complicated.
5. Testing -- Test your functions against linear cryptography (how is the input related to the output?) and differential cryptography (how is each input bit related to each output bit on a bit-by-bit basis?)
------
Obviously, hash functions (like SHA256) must be non-invertible by the end of it all. But you want to carefully think about where the source of non-invertibility comes from, and to minimize the loss of entropy/information at any particular step.
With these principles, its not very hard to make your own hash function. I'd suggest studying Bob Jenkin's "JOAAT" hash, just-one-at-a-time. Its a non-crypto hash, but it is probably one of the simplest hashes that uses the above principles: https://en.wikipedia.org/wiki/Jenkins_hash_function