trevphil / cryptosym Goto Github PK

View Code? Open in Web Editor NEW

36.0 3.0 11.0 7.15 MB

SAT-based, MILP, and belief propagation preimage attacks on SHA-256 and other cryptographic hash functions

Python 69.93% C++ 27.58% Shell 1.99% CMake 0.50%

sha256 cryptography optimization machine-learning combinatorial-optimization

cryptosym's Introduction

Bit relationships after 4-round SHA-256. 17806 nodes, 26383 edges.

Lossy "pseudo-hash" used to validate solver accuracy, showing relationship between input and output bits.

Description
Installation
- Python
- C++
Quickstart
- Writing your own hash function
How it works
- Dataset generation
- Dataset format
Solving Methods
References and Resources

Description

This repository contains Python and C++ code which attempts to reverse one-way cryptographic hash functions, with specific focus on SHA-256. A hash function f can be thought of as an operation on bits X to produce output bits Y: f(X) = Y. Given knowledge of Y and how f works, we want to find some bits X' such that f(X') = Y. This is commonly known as a preimage attack. Note that X does not necessarily need to equal X'.

A successful preimage attack has serious implications for basically the entire Internet, financial community, and national defense of major governments. Hash functions are used in all kinds of domains: from BitCoin mining and transactions, to HTTPS encryption, to storage of user passwords in server databases.

I've spent a long time (too long!) trying to solve this "impossible" problem using a variety of methods, detailed below. As a disclaimer: I do not claim that any of these methods break the security of the full 64-round SHA-256 hash function. It's probably still infeasible. Prove me wrong :)

Installation

It is not necessary to install everything listed here. If you just want to run certain parts of the code (e.g. C++ vs. Python) you can skip some steps. I developed everything on a Mac, so most things will translate well to Linux but maybe not to Windows.

Python

Use Python 3.6 with Anaconda. Start by creating and activating a new environment:

conda create -n preimage python=3.6
conda activate preimage
python --version  # Should show 3.6

Install project dependencies:

python -m pip install -r requirements.txt
python -m pip install -i https://pypi.gurobi.com gurobipy

Not all of the solving methods will work out-of-the-box from here. If you want certain solvers, you will need to install them individually:

For Cplex on Mac OS X, make sure to do:

export PYTHONPATH=/Applications/CPLEX_Studio1210/cplex/python/3.6/x86-64_osx

C++

You need to install cmake and cpp-yaml. Then to compile the code in this repository, nagivate to the belief_propagation directory and run:

$ ./kompile

Quickstart

Generate datasets:

$ ./generate_all_hashes.sh

Choose a dataset:

$ ls -la ./data
drwxr-xr-x  27 trevphil  staff  864 Nov  3 23:04 .
drwxr-xr-x  15 trevphil  staff  480 Nov 22 12:29 ..
drwxr-xr-x  11 trevphil  staff  352 Nov  3 22:54 addConst_d1
drwxr-xr-x  11 trevphil  staff  352 Nov  3 22:53 add_d1
drwxr-xr-x  11 trevphil  staff  352 Nov  3 22:54 andConst_d1
drwxr-xr-x  10 trevphil  staff  320 Nov  3 22:54 sha256_d1
...

The "d1" at the end means that the difficulty level is "1". The full SHA-256 algorithm uses a difficulty of 64 (because it has 64 rounds). Run a solver on the dataset of your choice:

$ python -m optimization.main data/addConst_d1 --solver ortools_cp

To see the available solvers and usage of the tool:

$ python -m optimization.main --help
usage: main.py [-h]
               [--solver {gradient,gnc,cplex_milp,cplex_cp,ortools_cp,ortools_milp,gurobi_milp,minisat,crypto_minisat}]
               dataset

Hash reversal via optimization

positional arguments:
  dataset               Path to the dataset directory

optional arguments:
  -h, --help            Show this help message and exit
  --solver {gradient,gnc,cplex_milp,cplex_cp,ortools_cp,ortools_milp,gurobi_milp,minisat,crypto_minisat}
                        The solving technique

To run the C++ belief propagation code after having compiled it with ./kompile, choose one of the config files and run the following command from the belief_propagation directory:

$ ./main config/config_file_name.yaml

Writing your own hash function

Take a look at the existing hash functions in dataset_generation/hash_funcs.py. For example, this function simply adds a constant value to the input:

class AddConst(SymbolicHash):
    def hash(self, hash_input: SymBitVec, difficulty: int):
        n = len(hash_input)
        A = SymBitVec(0x4F65D4D99B70EF1B, size=n)
        return hash_input + A

Implement your own hash function class which inherits from SymbolicHash and has the function hash(...), and add your class to the dictionary returned by the function hash_algorithms() in dataset_generation/hash_funcs.py. The following operations are supported for primitives (SymBitVec objects) in the hash function:

AND: C = A & B
OR: C = A | B
XOR: C = A ^ B
NOT (aka INV for "inverse"): B = ~A
Addition: C = A + B
- A, B, and C should have the same number of bits, overflow is ignored
Shift left: B = (A << n)
- B will have the same number of bits as A
Shift right: B = (A >> n)
- B will have the same number of bits as A

To generate a dataset using your hash function, run a command like the following:

$ python -m dataset_generation.generate --num-samples 64 --num-input-bits 64 --hash-algo my_custom_hash --visualize --difficulty 4

The dataset is explained more in-depth below. To see the usage of the dataset generation tool, run:

python -m dataset_generation.generate --help
usage: generate.py [-h] [--data-dir DATA_DIR] [--num-samples NUM_SAMPLES]
                   [--num-input-bits NUM_INPUT_BITS]
                   [--hash-algo {add,addConst,andConst,invert,lossyPseudoHash,nonLossyPseudoHash,orConst,sha256,shiftLeft,shiftRight,xorConst}]
                   [--difficulty DIFFICULTY] [--visualize]
                   [--hash-input HASH_INPUT] [--pct-val PCT_VAL]
                   [--pct-test PCT_TEST]

Hash reversal dataset generator

optional arguments:
  -h, --help            Show this help message and exit
  --data-dir DATA_DIR   Path to the directory where the dataset should be
                        stored
  --num-samples NUM_SAMPLES
                        Number of samples to use in the dataset
  --num-input-bits NUM_INPUT_BITS
                        Number of bits in each input message to the hash
                        function
  --hash-algo {add,addConst,andConst,invert,lossyPseudoHash,nonLossyPseudoHash,orConst,sha256,shiftLeft,shiftRight,xorConst}
                        Choose the hashing algorithm to apply to the input
                        data
  --difficulty DIFFICULTY
                        SHA-256 difficulty (an interger between 1 and 64
                        inclusive)
  --visualize           Visualize the symbolic graph of bit dependencies
  --hash-input HASH_INPUT
                        Give input message in hex to simply print the hash of
                        the input
  --pct-val PCT_VAL     Percent of samples used for validation dataset
  --pct-test PCT_TEST   Percent of samples used for test dataset

How it works

Dataset generation

The first thing we need in order to approach this problem is to find out how the hash function f operates on input bits X to produce output bits Y without making any assumptions about X. To this end, I make the hash functions operate on symbolic bit vectors, i.e. SymBitVec objects. These objects track the relationship between input and output bits for all computations in the hash function, e.g. if C = A & B, we track the relationship that bit C is the result of AND-ing A and B. Ultimately after all of the computations, we will have the relationship between each bit of Y and each bit of X in the function f(X) = Y.

Some simplifications can be made during this process. For example, let's say that bit A is a bit from the unknown input X and bit B is a constant in the hash algorithm equal to 0. Well then we know that C = A & 0 = 0 so C = 0 no matter the value of A. Some more simplifications for operations on single bits are listed below:

B = A & 1 = A
B = A | 0 = A
B = A | 1 = 1
B = A ^ 1 = ~A
B = A ^ 0 = A
B = A ^ A = 0
B = A & A = A | A = A

These simplifications help to reduce the size of the symbolic representation of the hash function, since the output bit B is sometimes a constant or equal to the unknown input A. When this happens, we don't need to introduce a new unknown variable.

Furthermore, the problem can be made easier to handle by reducing all operations (XOR, OR, addition) to only using AND and INV logic gates. For example, C = A ^ B is equivalent to:

X = ~(A & B)
Y = ~(A & X)
Z = ~(B & X)
C = ~(Y & Z)

This does introduce intermediate variables, but critically, the AND and INV operations can be linearized and also represented in the continuous domain, which is important mathematically. Normally in the discrete domain, we would consider each bit as a binary "random variable" taking the value 0 or 1.

The AND operation can be represented with multiplication: C = A & B = A * B, and the INV operation with subtraction: B = ~A = 1 - A. To linearize the AND operation, we can use the following method:

C = A & B = A * B
Equivalent to:
C <= A
C <= B
C >= A + B - 1

Note that no information is lost during the INV operation (we can always recover the input from the output of INV), but there is information lost during AND when the output is 0. When the output is 1, we know that both inputs must have been 1. But when the output is 0, there are three possible inputs: 0 = 0 & 0 = 0 & 1 = 1 & 0. Thus, all complexity in reversing a hash function comes from AND gates whose output is 0.

Dataset Format

When you create a dataset using the command python -m dataset_generation.generate [args], it will create a sub-directory under ./data with a name like hashFunc_dX where "hashFunc" is the name of the hash function and "X" will be replaced with an integer difficulty level used by the hash function. Within this directory, the following files will be created:

hashFunc_dX
├── data.bits
├── factors.cnf
├── factors.txt
├── graph.graphml
├── graph.pdf
├── params.yaml
├── test.hdf5
├── train.hdf5
└── val.hdf5

Below is an outline of what each file contains:

params.yaml: Information about the dataset like the name of the hash algorithm, number of input bits X, and indices of the hash output bits Y with respect to all of the random variable bits tracked in the hash computation
data.bits: This is a binary file. Let's say you chose the options --num-samples 64 --num-input-bits 128, then the hash function will execute 64 times, each time with a random 128-bit input to the hash function. Let's say 1 pass of the hash function generates 2000 random variable bits, starting with the hash input bits X directly and (generally) ending with the hash output bits Y, although the Y bits might not be grouped together consecutively at the end. This file will contain 64*2000 bits which result from concatenating 2000 bits 64 times. When I say that a bit has index i, it means the i-th bit of 2000 bits. The number of samples should always be a multiple of 8 to avoid filesystem errors where the file length does not fit into an integer number of bytes.
factors.txt: Encodes the relationship between input and output bits of logic gates. Some examples follow...
- PRIOR;10: The random variable bit with index 10 is a prior, i.e. a bit from the unknown input X
- INV;94;93: The random variable bit with index 94 is a result of the INV operation on bit 93, i.e. B94 = ~B93
- AND;95;61;29: Bit 95 is the result of AND-ing bits 61 and 29, i.e. B95 = B61 & B29
factors.cnf: An alternative representation of the relationship between random variable bits using DIMACS Conjunctive Normal Form (CNF). All logic gates can be converted to this form, see factor.py. The bit indices in CNF are all +1 relative to their indices in the other representations, so keep that in mind.
graph.pdf: If you specify the optional --visualize argument to the dataset generation tool, it will create a visualization of the hash function like the one shown in the beginning of the README. Hash input bits are shown in black, and output bits in green.
graph.graphml: This is another representation of the directed graph in graphml format showing relationships between bits, useful for visualizing in tools like Gephi
test.hdf5, train.hdf5, val.hdf5: These contain the same data as data.bits but in HDF5 format which is convenient for machine learning. The samples from the --num-samples option are distributed among train, test, and validation sets.

Solving Methods

Using the latest and greatest as of November 2020, the solving methods are listed from what I believe to be most effective to least effective. Even the best methods seem to fail on the 17th round of SHA-256 (discussed below).

CryptoMiniSat: I didn't even take advantage of CryptoMiniSat's optimized treatment of XOR since all operations were reduced to AND and INV logic gates
MiniSat
Gurobi MILP
Google ortools MILP
Google ortools constraint programming
Cplex constraint programming
Cplex MILP
Least-squares optimization
Graduated non-convexity
Loopy belief propagation
Graph-based neural network

These strategies can be broken into a few general categories that I will discuss soon.

First, here is a log-log plot of problem size (# unknown variables) vs. runtime for a select number of solvers, as well as a linear regression on this log-log data. The problem sizes correspond to SHA-256 rounds of 1, 4, 8, 12, and 16. These solvers were run on my weak dual-core 16 GB MacBook Air, and none were allowed to run for more than 24 hours.

I think it is interesting to see how the SHA-256 algorithm's complexity increases with the number of rounds. Something happens in the 17th round that causes a major spike in complexity. My best guess is that an AND gate between bits "very far apart" in the computation is the issue. If you look at the maximum distance between the indices of random variable bits which are inputs to an AND gate, the maximum gap is around 126,000 bits apart up to 16 rounds. At 17 rounds, the gap increases to 197,000 bits. For a SAT solver, this could mean that it doesn't detect an invalid variable assignment until values have been propagated a long way. For the full 64-round SHA-256, the maximum gap is 386,767.

Below, we can see that the number of INV and AND gates grow relatively linearly, at the same rate, as the number of rounds increases. The prior factors correspond to hash input bits, so naturally they stay constant.

SAT Solvers

Satisfiability (SAT) solvers operate on boolean logic relationships to find a satisfying assignment for free boolean variables. I would put solvers like MiniSat, CryptoMiniSat, ortools constraint programming, and Cplex constraint programming in this category. These types of solvers generally work by assigning fixed values to free variables until there is a logical conflict, and then backtracking until the conflict is resolved and trying new variable assignments in the conflict region. This is a gross simplification though. There are a lot of tricks people have come up with to speed up the process, heuristics to pick the order of variable assignments, learning from conflicts, etc. Here is a nice website to get started learning, and this guy gives a nice talk.

From what I can tell, this is one of the best methods for preimage attacks to date. A lot of the solvers are implemented in C or C++ so they run extremely fast, and they are heavily optimized with heuristics. People have already tried to use SAT solvers to break cryptographic hash functions, but you do eventually hit a limit on the feasible problem size.

MILP

Mixed-integer linear programming (MILP) is a form of optimization where the optimization variables may be integer-valued, rather than the usual real-valued. Gurobi MILP, Cplex MILP, and ortools MILP fall into this category. The theory underpinning MILP solvers is honestly pretty complex (or should I say... simplex). However, if you want to read more, Google terms like "linear programming relaxations," "branch and bound," or "cutting planes." A good place to start is Gurobi's introduction to MIP.

Solvers from the major players like Gurobi, Cplex, and coin-or work well (much like the SAT solvers) until 17 rounds of SHA-256.

Optimization

As opposed to mixed-integer optimization, I also tried out plain-ol' regular optimization with real-valued optimization variables. A vector x with N elements is optimized, which represents the value of each bit in the hash computation. To convert to boolean values, I set bit i to 1 if x[i] > 0.5, otherwise 0. Bit inversions are represented as equality constraints in the optimization, Ax = b, and AND gates are represented by 3 inequalities (described earlier), giving inequality constraints Cx <= d. Upper and lower bounds on x can also be added to the inequality constraints, enforcing 0 <= x <= 1.

For an example of a bit inversion constraint, let's say bit_1 = ~bit_0 = 1 - bit_0. Then we would have a row in A like [1 1 0 ... 0] and the corresponding element in the b vector would be 1. Then Ax = b gives bit_0 + bit_1 = 1.

The initial guess for x is initialized to 0.5 except for the observed hash output bits, which are initialied to their observed values. I didn't add the observed bits to the equality constraints because SLSQP complains that the system is overdetermined.

This method turns out to perform poorly if the problem gets even moderately complex. Approximating boolean values by rounding real numbers doesn't translate well into a valid solution.

Robust Estimation and Graduated Non-Convexity

I've done a fair bit of work on graph-based SLAM and had an idea to apply something from robust estimation techniques to solving the preimage problem. This something is called "Graduated Non-Convexity" (GNC), introduced in this paper.

In GraphSLAM, a lot of research has gone into rejecting outliers that are incorrectly added to the graph. When the graph is optimized, certain techniques like dynamic covariance scaling (DCS) can help the graph optimization to ignore outlies that influence the objective function. GNC achieves a similar outcome but works by turning a non-convex optimization problem into a convex one, and then gradually decreasing the convexity with a hyperparameter which changes after each iteration. My idea was to apply the same thing to the preimage problem. For each AND gate, we can add "conflicting" constraints which make different assumptions about the AND gate variables (think: 0 & 0, 0 & 1, 1 & 0, 1 & 1), some of which will be incorrect. As the optimization proceeds, the robust estimation technique will reject the bad constraints and keep the good ones.

Unfortunately, much like before, this turned out to not work very well for even moderate problems. However, if this technique could be extended for MIP optimization, maybe we would see quite different results (TODO).

Belief Propagation

Belif propagation (BP) is an iterative "message passing" algorithm in which messages are passed between factors and random variables (RVs) in a factor graph. We say the algorithm has converged when none of the messages change by more than some small value ε.

Once the messages have converged, we can use them to perform queries on the factor graph, for example to answer the question "What is the probability that the input message bit 0 is a 1, given that I know all of the hash bits?"

What exactly is a "message"? It's hard to explain, to be honest. It's quite mathematical, theoretical, and unintuitive. I would recommend this article to understand it.

A factor graph is a bipartite graph wherein one side of the graph has RVs (bits) as nodes, and the other side has all of the "factors" as nodes (AND, INV logic gates). Each factor represents a conditional probability distribution and has a "query" RV (output of the logic gate) as well as a list of "dependency" RVs (inputs to the logic gate).

For example, let C = A & B. Then a factor f may represent P(C | A, B), i.e. the probability of observing C = 0 or C = 1, given that we know the values of A and B.

Each factor has a table which contains the conditional probability distribution (CPD) of the query bit, given all possible values of the dependencies. The table is used during the message-passing algorithm. If there are N dependencies in a factor, the table size is 2^N, so you can see how it is beneficial to keep the size of each factor low. Our factor f would have a table such as:

`A`	`B`	`P(C = 1 ｜ A, B)`
0	0	`P(1 ｜ 0, 0) = 0`
0	1	`P(1 ｜ 0, 1) = 0`
1	0	`P(1 ｜ 1, 0) = 0`
1	1	`P(1 ｜ 1, 1) = 1`

Note: It's not necessary to compute P(C = 0 | A, B) because it can be derived by 1.0 - P(C = 1 | A, B).

I have implemented the belief propagation algorithm in C++ since it can be quite slow in Python with a large problem size. However I found that this method performs poorly in practice. Belief propagation on a tree structure will always converge, but there are no convergence guarantees for a cyclic factor graph (so-called "loopy belief propagation").

What often ends up happening is divergence of the message values because of the cyclic message passing, and we run into numerical overflow/underflow errors. A possible solution could be a logarithmic version of the sum-product algorithm, which I tried to implement but gave up on (see here, TODO: try this one).

Machine Learning

The idea here is that one could train a neural network to predict a valid hash input X given knowledge of hash output Y and the hash function f where f(X) = Y. In other words, a neural network should learn an inverse function g where f(g(Y)) = Y by observing many instances of random inputs and outputs. To this end, I (painfully) modified the SymBitVec primitive to support PyTorch tensors and work 100% with backpropagation. I also modified the dataset generation tool to split samples into train, validation, and test files in HDF5 format.

The neural network architecture needs more work and thought put into it. As of now it doesn't perform well at all. I have struggled with enforcing "hard" relationships between bits, for example the network tries to learn valid assignments for all bits in the hash computation, but at the same time needs to be aware that if two bits are related by an INV operation, then B = 1 - A. Graph-based neural networks are one possible approach, but I recently saw this talk and corresponding code for a network called "NeuroSAT", which is extremely interesting and possibly promising.

Backpropagation through the hash function (and more in general, training a network on a large problem) is unfortunately quite slow. I believe this is a result of the complex Autograd graph that the hash function creates, due to all of the slicing and moving around of individual tensor elements (bits).

References and Resources

You can always try to contact me (trevphil3 -at- gmail -dot- com) and I will do my best to respond. I have spent over a year working on this problem in my free time and I'm quite passionate about it. Besides the links in the body of the README, here are some more helpful articles and papers:

cryptosym's People

Contributors

Stargazers

Watchers

Forkers

qq431169079 chadbrewbaker secp8x32 liuguozhen0 5l1v3r1 wbeaching kpot87 gauthamastro coola1d

cryptosym's Issues

Store and load dataset more compactly as comma-separated hex (for example)

AND gate

The AND gate does not seem to work for either GTSAM or LBP.

Additionally for LBP, the initialization of factor/RV messages for observed RVs needs to be double-checked.

Vectorize LBP

Better SHA-256 hash function

Try this one: https://bitbucket.org/pypy/pypy/src/tip/lib_pypy/_sha256.py

Current version doesn't work with < 512 bits

Try these ideas

~~Use sampling (GTSAM)~~
Use tree-reweighted max product
With LBP, maybe it will converge to the correct marginals if you "observed" N internal bits of SHA-256. Of course, these aren't really observed, but if you brute force all 2^N possibilities and N is small and it converges, it could work [LBP probably won't work]
~~Use variational inference (this is not being done by GTSAM DiscreteBayesNet). Can use PyMC3.~~
Optimization techniques for high-dimensional variables: read here and here
~~Particle swarm optimization~~ [too slow, didn't work]
~~"Observe" as much as you can from hash outputs, e.g. if the parent is SAME, INV, or AND and the result is 1~~ [done]
Reformulate AND gate as a set of switchable constraints in which only a subset should be active, and then apply robust estimation techniques like DCS and GNC
Try to use continuous random variables for 32-bit BitVectors and formulate continuous-domain equivalents for AND, INV, XOR, SHIFT, OR, ADD
Represent hash function as a neural network (maintaining AND and NOT operations), but incorporate skip connections such that nodes can be made aware of decisions upstream. Look into references at the bottom of this article.

Make probability calculation better for nodes with many connections

Try normal CPD then remove 1, 2, 3... edges randomly until a valid probability is found

Handling AND factors whose inputs are the same

Not working:

add
add const
lossy pseudo hash
sha256

Working:

non-lossy pseudo hash
xor const
shift
invert
and const

What do the ones that don't work have in common? The AND gate inputs are the same random variable

Fix:
It's a hack, but if the same RV goes into an AND gate as both inputs, create a child RV from the original RV and feed the original and the child into the AND gate. However this introduces a small loop in the undirected graph which may cause issues with LBP convergence.

Efficient BN construction

Given the fully connected graph as an adjacency matrix, how to efficiently prune the graph to get a good Bayesian network structure?

There should be a maximum number of neighbors for each node.

Linear algebra will probably give the best solution. Look into eigenvectors/eigenvalues of the adjacency matrix and properties of the Laplacian matrix.

Good results with only 2 iterations?

With the final 32 out of 512 input bits unknown, LBP can correctly get the SHA-256 at level 15 difficulty with only 2 LBP iterations.

But the Bayesian network is NOT acyclic (not a polytree) since we need to consider cycles in the undirected version of the BN. If it was acyclic, 2 iterations would be enough and everything would converge to the true marginals.

~~If bits which are close to the hash are predicted very accurately, you can potentially "freeze" them as observed variables, then re-run BP and keep doing this.~~ No, if bits near the hash could take either 0 or 1 based on their constraints (children bits), you would need to track both hypotheses for upstream.

Test data should not be used in "training" in MATLAB script

Try normal CPD then remove 1, 2, 3... edges randomly until a valid probability is found

Accommodate large connections per node

For example, with the noisy-OR model.

Add config file for each dataset

Add config YAML file for each dataset so that information does not have to be derived from the filename

This should also enable me to remove HASH_INPUT_NBITS from main.py

ihat should be symmetric

It's not symmetric for 01 and 10 cases. Is it a bug?

Add convergence criteria for LBP

Read this (paper about SHA-3 and factor graphs)

https://eprint.iacr.org/2020/371.pdf

Prevent edges between hash input bits in MATLAB script

Move all constants to config

Visualize node weights during LBP

See how LLR for each node changes as the LBP algorithm runs. Maybe it gives an insight about convergence

Need graph nodes to be structured consistently (e.g. hash -> internals -> input)
Create a directory under ./experiments for each run and log the graphML files to that, as well as output logs

Correctly predict 2 rounds

1 round is already working, but not 2.

Graph pruning algorithm is shit right now. Look more into graph sparsening/roughening.

Improve efficiency of MATLAB script

It's not necessary to calculate separately for 00, 01, 10, and 11. The results of 11 can bootstrap the others.

Try sampling methods

Probs won't work but you can give it a shot (MCMC, Gibbs sampling)

Clean up Python code and update README for C++

Predict all bits of hash instead of just 1

What happens at round 17?

Love your work here. Exactly the kind of stuff I find fascinating to obsess about even that it's "impossible" to solve.

You wrote,

Something happens in the 17th round that causes a major spike in complexity. My best guess is that an AND gate between bits "very far apart" in the computation is the issue. If you look at the maximum distance between the indices of random variable bits which are inputs to an AND gate, the maximum gap is around 126,000 bits apart up to 16 rounds. At 17 rounds, the gap increases to 197,000 bits. For a SAT solver, this could mean that it doesn't detect an invalid variable assignment until values have been propagated a long way. For the full 64-round SHA-256, the maximum gap is 386,767.

This sounds plausible to me, and I'd just like to, as someone who knows very little about this subject but is curious enough to think about it a little, propose a mechanism for this.

In the first 16 rounds the SHA-256 algorithm is still just ingesting straight words from the message. At each round it takes the next word from the unmodified input message.

If you were to look at the equation for each internal bit variable it would be almost entirely linear. The only source of non-linearity comes from CH and MAJ in the compression function but since they only affect 2 words per round it's like that non-linearity as a function of the input is spreading only very "slowly" through the internal state. Like you wrote, the "distance" between significant AND gates is still small.

(Side note: in your case since you convert XOR to AND, I guess there's actually a bunch of additional apparent non-linearity in the equations you feed to the SAT solver, but it's "fake" non-linearity, in some sense, which especially CryptoMiniSat just might "see through" since it wants to use its hallmark Jordan Gauss elimination.)

Then look at what happens at round 17: we take the first word from the message schedule that wasn't a straight copy from the input message. It mixes bits from 4 different input words using rotations and 32 bit addition. At the boolean bit vector level we now have true non-linearity from the input since we have to use AND gates to implement that carry flag. And it's "long distance" since word 17 mixes in input all the way back from the 1st word.

So that's my take on the complexity explosion. (I could be completely wrong, something about hashing functions just is so inherently unintuitive to me I keep catching myself believing they can't actually work at all. It's like... they have to be reversible, right? Just haven't seen the trick.)

Add relevant RVs from SHA-256

I think there are some important random variables which are not being captured in the internals, holding back LBP from converging. Look at algo more closely.

Represent dataset in the code as bit vector instead of DataFrame

TODO

https://github.com/dselsam/neurosat/tree/master/python
Find AND gates whose inputs are very far apart. Assign fixed value to the input with the smaller bit index (improve SAT convergence?)

Visualize accuracy of each bit and LLR

Log-domain LBP

Try to derive LBP in the log domain, on your own. Because messages often diverge to inf and 1e-70.

Add a license

This project and readme is impressive! Wondering if you could consider adding either MIT, Apache 2.0, or GNU GPLv3? This would give users permissions to run the code locally, make modifications on their local machine, and if something is interesting is discovered, contribution a pull request upstream. Thanks for considering this issue!

GTSAM

Can possibly use GTSAM with DCS to identify graph topology and perform inference?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.