trevphil / cryptosym Goto Github PK

SAT-based, MILP, and belief propagation preimage attacks on SHA-256 and other cryptographic hash functions

Python 69.93% C++ 27.58% Shell 1.99% CMake 0.50%

sha256 cryptography optimization machine-learning combinatorial-optimization

cryptosym's Issues

Add relevant RVs from SHA-256

I think there are some important random variables which are not being captured in the internals, holding back LBP from converging. Look at algo more closely.

Represent dataset in the code as bit vector instead of DataFrame

Handling AND factors whose inputs are the same

Not working:

add
add const
lossy pseudo hash
sha256

Working:

non-lossy pseudo hash
xor const
shift
invert
and const

What do the ones that don't work have in common? The AND gate inputs are the same random variable

Fix:
It's a hack, but if the same RV goes into an AND gate as both inputs, create a child RV from the original RV and feed the original and the child into the AND gate. However this introduces a small loop in the undirected graph which may cause issues with LBP convergence.

Good results with only 2 iterations?

With the final 32 out of 512 input bits unknown, LBP can correctly get the SHA-256 at level 15 difficulty with only 2 LBP iterations.

But the Bayesian network is NOT acyclic (not a polytree) since we need to consider cycles in the undirected version of the BN. If it was acyclic, 2 iterations would be enough and everything would converge to the true marginals.

~~If bits which are close to the hash are predicted very accurately, you can potentially "freeze" them as observed variables, then re-run BP and keep doing this.~~ No, if bits near the hash could take either 0 or 1 based on their constraints (children bits), you would need to track both hypotheses for upstream.

Better SHA-256 hash function

Try this one: https://bitbucket.org/pypy/pypy/src/tip/lib_pypy/_sha256.py

Current version doesn't work with < 512 bits

GTSAM

Can possibly use GTSAM with DCS to identify graph topology and perform inference?

Move all constants to config

Visualize accuracy of each bit and LLR

Log-domain LBP

Try to derive LBP in the log domain, on your own. Because messages often diverge to inf and 1e-70.

Improve efficiency of MATLAB script

It's not necessary to calculate separately for 00, 01, 10, and 11. The results of 11 can bootstrap the others.

ihat should be symmetric

It's not symmetric for 01 and 10 cases. Is it a bug?

Store and load dataset more compactly as comma-separated hex (for example)

Visualize node weights during LBP

See how LLR for each node changes as the LBP algorithm runs. Maybe it gives an insight about convergence

Need graph nodes to be structured consistently (e.g. hash -> internals -> input)
Create a directory under ./experiments for each run and log the graphML files to that, as well as output logs

Try sampling methods

Probs won't work but you can give it a shot (MCMC, Gibbs sampling)

Prevent edges between hash input bits in MATLAB script

Correctly predict 2 rounds

1 round is already working, but not 2.

Graph pruning algorithm is shit right now. Look more into graph sparsening/roughening.

AND gate

The AND gate does not seem to work for either GTSAM or LBP.

Additionally for LBP, the initialization of factor/RV messages for observed RVs needs to be double-checked.

Try these ideas

~~Use sampling (GTSAM)~~
Use tree-reweighted max product
With LBP, maybe it will converge to the correct marginals if you "observed" N internal bits of SHA-256. Of course, these aren't really observed, but if you brute force all 2^N possibilities and N is small and it converges, it could work [LBP probably won't work]
~~Use variational inference (this is not being done by GTSAM DiscreteBayesNet). Can use PyMC3.~~
Optimization techniques for high-dimensional variables: read here and here
~~Particle swarm optimization~~ [too slow, didn't work]
~~"Observe" as much as you can from hash outputs, e.g. if the parent is SAME, INV, or AND and the result is 1~~ [done]
Reformulate AND gate as a set of switchable constraints in which only a subset should be active, and then apply robust estimation techniques like DCS and GNC
Try to use continuous random variables for 32-bit BitVectors and formulate continuous-domain equivalents for AND, INV, XOR, SHIFT, OR, ADD
Represent hash function as a neural network (maintaining AND and NOT operations), but incorporate skip connections such that nodes can be made aware of decisions upstream. Look into references at the bottom of this article.

What happens at round 17?

Love your work here. Exactly the kind of stuff I find fascinating to obsess about even that it's "impossible" to solve.

You wrote,

Something happens in the 17th round that causes a major spike in complexity. My best guess is that an AND gate between bits "very far apart" in the computation is the issue. If you look at the maximum distance between the indices of random variable bits which are inputs to an AND gate, the maximum gap is around 126,000 bits apart up to 16 rounds. At 17 rounds, the gap increases to 197,000 bits. For a SAT solver, this could mean that it doesn't detect an invalid variable assignment until values have been propagated a long way. For the full 64-round SHA-256, the maximum gap is 386,767.

This sounds plausible to me, and I'd just like to, as someone who knows very little about this subject but is curious enough to think about it a little, propose a mechanism for this.

In the first 16 rounds the SHA-256 algorithm is still just ingesting straight words from the message. At each round it takes the next word from the unmodified input message.

If you were to look at the equation for each internal bit variable it would be almost entirely linear. The only source of non-linearity comes from CH and MAJ in the compression function but since they only affect 2 words per round it's like that non-linearity as a function of the input is spreading only very "slowly" through the internal state. Like you wrote, the "distance" between significant AND gates is still small.

(Side note: in your case since you convert XOR to AND, I guess there's actually a bunch of additional apparent non-linearity in the equations you feed to the SAT solver, but it's "fake" non-linearity, in some sense, which especially CryptoMiniSat just might "see through" since it wants to use its hallmark Jordan Gauss elimination.)

Then look at what happens at round 17: we take the first word from the message schedule that wasn't a straight copy from the input message. It mixes bits from 4 different input words using rotations and 32 bit addition. At the boolean bit vector level we now have true non-linearity from the input since we have to use AND gates to implement that carry flag. And it's "long distance" since word 17 mixes in input all the way back from the 1st word.

So that's my take on the complexity explosion. (I could be completely wrong, something about hashing functions just is so inherently unintuitive to me I keep catching myself believing they can't actually work at all. It's like... they have to be reversible, right? Just haven't seen the trick.)

Test data should not be used in "training" in MATLAB script

Try normal CPD then remove 1, 2, 3... edges randomly until a valid probability is found

Read this (paper about SHA-3 and factor graphs)

https://eprint.iacr.org/2020/371.pdf

Efficient BN construction

Given the fully connected graph as an adjacency matrix, how to efficiently prune the graph to get a good Bayesian network structure?

There should be a maximum number of neighbors for each node.

Linear algebra will probably give the best solution. Look into eigenvectors/eigenvalues of the adjacency matrix and properties of the Laplacian matrix.

Clean up Python code and update README for C++

Add convergence criteria for LBP

Vectorize LBP

TODO

https://github.com/dselsam/neurosat/tree/master/python
Find AND gates whose inputs are very far apart. Assign fixed value to the input with the smaller bit index (improve SAT convergence?)

Accommodate large connections per node

For example, with the noisy-OR model.

Make probability calculation better for nodes with many connections

Try normal CPD then remove 1, 2, 3... edges randomly until a valid probability is found

Add a license

This project and readme is impressive! Wondering if you could consider adding either MIT, Apache 2.0, or GNU GPLv3? This would give users permissions to run the code locally, make modifications on their local machine, and if something is interesting is discovered, contribution a pull request upstream. Thanks for considering this issue!

Add config file for each dataset

Add config YAML file for each dataset so that information does not have to be derived from the filename

This should also enable me to remove HASH_INPUT_NBITS from main.py

trevphil / cryptosym Goto Github PK

cryptosym's Issues

Recommend Projects

Recommend Topics

Recommend Org