Giter Club home page Giter Club logo

cryptosym's Issues

Add relevant RVs from SHA-256

I think there are some important random variables which are not being captured in the internals, holding back LBP from converging. Look at algo more closely.

Handling AND factors whose inputs are the same

Not working:

  • add
  • add const
  • lossy pseudo hash
  • sha256

Working:

  • non-lossy pseudo hash
  • xor const
  • shift
  • invert
  • and const

What do the ones that don't work have in common? The AND gate inputs are the same random variable

Fix:
It's a hack, but if the same RV goes into an AND gate as both inputs, create a child RV from the original RV and feed the original and the child into the AND gate. However this introduces a small loop in the undirected graph which may cause issues with LBP convergence.

Good results with only 2 iterations?

With the final 32 out of 512 input bits unknown, LBP can correctly get the SHA-256 at level 15 difficulty with only 2 LBP iterations.

But the Bayesian network is NOT acyclic (not a polytree) since we need to consider cycles in the undirected version of the BN. If it was acyclic, 2 iterations would be enough and everything would converge to the true marginals.

If bits which are close to the hash are predicted very accurately, you can potentially "freeze" them as observed variables, then re-run BP and keep doing this. No, if bits near the hash could take either 0 or 1 based on their constraints (children bits), you would need to track both hypotheses for upstream.

GTSAM

Can possibly use GTSAM with DCS to identify graph topology and perform inference?

Log-domain LBP

Try to derive LBP in the log domain, on your own. Because messages often diverge to inf and 1e-70.

Visualize node weights during LBP

See how LLR for each node changes as the LBP algorithm runs. Maybe it gives an insight about convergence

  • Need graph nodes to be structured consistently (e.g. hash -> internals -> input)
  • Create a directory under ./experiments for each run and log the graphML files to that, as well as output logs

Correctly predict 2 rounds

1 round is already working, but not 2.

Graph pruning algorithm is shit right now. Look more into graph sparsening/roughening.

AND gate

The AND gate does not seem to work for either GTSAM or LBP.

Additionally for LBP, the initialization of factor/RV messages for observed RVs needs to be double-checked.

Try these ideas

  • Use sampling (GTSAM)
  • Use tree-reweighted max product
  • With LBP, maybe it will converge to the correct marginals if you "observed" N internal bits of SHA-256. Of course, these aren't really observed, but if you brute force all 2^N possibilities and N is small and it converges, it could work [LBP probably won't work]
  • Use variational inference (this is not being done by GTSAM DiscreteBayesNet). Can use PyMC3.
  • Optimization techniques for high-dimensional variables: read here and here
  • Particle swarm optimization [too slow, didn't work]
  • "Observe" as much as you can from hash outputs, e.g. if the parent is SAME, INV, or AND and the result is 1 [done]
  • Reformulate AND gate as a set of switchable constraints in which only a subset should be active, and then apply robust estimation techniques like DCS and GNC
  • Try to use continuous random variables for 32-bit BitVectors and formulate continuous-domain equivalents for AND, INV, XOR, SHIFT, OR, ADD
  • Represent hash function as a neural network (maintaining AND and NOT operations), but incorporate skip connections such that nodes can be made aware of decisions upstream. Look into references at the bottom of this article.

What happens at round 17?

Love your work here. Exactly the kind of stuff I find fascinating to obsess about even that it's "impossible" to solve.

You wrote,

Something happens in the 17th round that causes a major spike in complexity. My best guess is that an AND gate between bits "very far apart" in the computation is the issue. If you look at the maximum distance between the indices of random variable bits which are inputs to an AND gate, the maximum gap is around 126,000 bits apart up to 16 rounds. At 17 rounds, the gap increases to 197,000 bits. For a SAT solver, this could mean that it doesn't detect an invalid variable assignment until values have been propagated a long way. For the full 64-round SHA-256, the maximum gap is 386,767.

This sounds plausible to me, and I'd just like to, as someone who knows very little about this subject but is curious enough to think about it a little, propose a mechanism for this.

In the first 16 rounds the SHA-256 algorithm is still just ingesting straight words from the message. At each round it takes the next word from the unmodified input message.

If you were to look at the equation for each internal bit variable it would be almost entirely linear. The only source of non-linearity comes from CH and MAJ in the compression function but since they only affect 2 words per round it's like that non-linearity as a function of the input is spreading only very "slowly" through the internal state. Like you wrote, the "distance" between significant AND gates is still small.

(Side note: in your case since you convert XOR to AND, I guess there's actually a bunch of additional apparent non-linearity in the equations you feed to the SAT solver, but it's "fake" non-linearity, in some sense, which especially CryptoMiniSat just might "see through" since it wants to use its hallmark Jordan Gauss elimination.)

Then look at what happens at round 17: we take the first word from the message schedule that wasn't a straight copy from the input message. It mixes bits from 4 different input words using rotations and 32 bit addition. At the boolean bit vector level we now have true non-linearity from the input since we have to use AND gates to implement that carry flag. And it's "long distance" since word 17 mixes in input all the way back from the 1st word.

So that's my take on the complexity explosion. (I could be completely wrong, something about hashing functions just is so inherently unintuitive to me I keep catching myself believing they can't actually work at all. It's like... they have to be reversible, right? Just haven't seen the trick.)

Efficient BN construction

Given the fully connected graph as an adjacency matrix, how to efficiently prune the graph to get a good Bayesian network structure?

There should be a maximum number of neighbors for each node.

Linear algebra will probably give the best solution. Look into eigenvectors/eigenvalues of the adjacency matrix and properties of the Laplacian matrix.

Add a license

This project and readme is impressive! Wondering if you could consider adding either MIT, Apache 2.0, or GNU GPLv3? This would give users permissions to run the code locally, make modifications on their local machine, and if something is interesting is discovered, contribution a pull request upstream. Thanks for considering this issue!

Add config file for each dataset

Add config YAML file for each dataset so that information does not have to be derived from the filename

This should also enable me to remove HASH_INPUT_NBITS from main.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.