Giter Club home page Giter Club logo

randomx's People

Contributors

alkenso avatar antanst avatar bjacquin avatar cjdelisle avatar cryptonote-social avatar hyc avatar jtgrassie avatar nioroso-x3 avatar schernykh avatar selsta avatar ston1th avatar tevador avatar vielmetti avatar vladimirvr avatar wepeng avatar xiphon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

randomx's Issues

Proposal: Coming full circle

My primary concern with randomX and its predecessor CNVM is that the machine specification itself is too simple, and can be implemented in a very minimally equipped CPU. I also still believe that forcing execution of a language processor (i.e., a compiler or interpreter) is a more extensive "CPU-like" task and we should still be leveraging that.

Our main problem with randomJS is that if you can randomly generate a syntactically correct program source code, you can just take the generated AST and execute it directly, instead of bothering to execute the JS.

So, instead, what if we combine these two approaches - we specify a language and a reference compiler implementation that compiles the given language into bytecode. So a valid implementation must generate identical bytecode for a given source text. And then, we feed the bytecode into randomX - which means the generated AST has nothing to do with the resulting execution, so it can't be used as a shortcut. I.e., the bytecode and the randomX instruction set are completely independent entities. This is just a long-winded way of using a compiler to generate the random bytes that randomX will execute.

FPU underutilized

Currently, the floating point scheduler is empty about 50% of the time (measured with Ryzen 1700). This means the FPU is underutilized.

Possible solutions:

  • Increasing the portion of floating point instructions.
  • Adding a simple floating point operation to all integer instructions.

Branch prediction

Currently, branches are taken/not taken randomly. This is not ideal for CPUs as they rely heavily on branch prediction.

Conditions for branching instructions should be designed such that they produce random repeating patterns.

NUMA support

on a quad-socket server (4x Xeon E5-4640), a single process with 64 threads maxes out at about 3700 H/s.

running 4 processes at once (one per physical CPU) with 8 threads each, I get about 2525 H/s per process for a total of 10100 H/s.

since real mining software will almost certainly have NUMA support, it would probably be good to implement it here so people get a more accurate idea of actual mining hashrates.

Accessing program variable in the randomx_vm class

alignas(64) randomx::Program program; is protected member of class randomx_vm:

https://github.com/tevador/RandomX/blob/master/src/virtual_machine.hpp#L54

I don't see any way of accessing it from the derived classes. This makes it impossible to inspect the RandomX code used in generating POW in randomx_calculate_hash.

https://github.com/tevador/RandomX/blob/master/src/randomx.cpp#L236

I think such ability is important and I was wondering if the program variable could be exposed. For example, by adding getProgram into class randomx_vm, in the following, or similar form:

 randomx::Program const* getProgram()
 {
       return &program;
 };

position-independent code

Modern Linux distros now compile with -fPIE enabled by default. Linking fails because of absolute references

/usr/bin/ld: obj/JitCompilerX86-static.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIC

Adding -no-pie to CFLAGS, CCFLAGS, and LDFLAGS gets around this for now. Is that what we actually want now, since PIC is slower?

Error when compiling with clang

I tried to compile RandomX with clang by changing the CC and CXX variables in the makefile.

With clang-5, I got a compile error:
src/asm/program_prologue_linux.inc:35:28: error: cannot use base register with variable reference movapd xmm13, xmmword ptr mantissaMask[rip]

With clang-7, some linker error:
obj/argon2_core.o: file not recognized: file format not recognized clang-7: error: linker command failed with exit code 1 (use -v to see invocation)

gcc 7.4.0 can compile it with no problem.

Update:
I tried clang-6, got the same linker error as clang-7

MacOS stack size

The compiled VM will crash on MacOS when using more than 1 thread,

This is because MacOS has a reduced stack size of 512 KB for pthreads. RandomX requires a minimum of 8 MB.

GPU performance range evaluation

I made a quick and dirty test to see how this algo would work with an amd GPU.
This is how I did it:

  • I created a 4 GB dataset and filled it with the CPU with random data,
  • created one 2MB scratchpad for each GPU threads and fill them with random data on the CPU
  • divided the 32 KB of the local data store (LDS) as 2 L1 caches
  • used workgroup size 32, with only one thread working out of 16. [on amd the wave size is 16 and has its own Program Counter. ] With a vega56, the card is configured to behave as a 112 processor cores.
  • as an alternative, I used workgroup size of 16 with one active thread, one 16 KB L1 and had same results
  • created the RandomX sequence as a program with init, 2048 loops of 256 instructions with specified weight and L1/L2 proportions (around 50% each), the dataset mixing with cacheline prefetch, and writing registers to scratchpad.

Before giving the results, some considerations:

  • some FPU functions are wrongly implemented (like changing the rounding mode, fscale)
  • the compiler might have optimized the sequence of instruction, though I took care of taking the registers random
  • all threads execute the same program: but as explained above each wave has it's own Program Counter that can be set with an assembly instruction. GPU code can be created on the CPU,
    DMA transfered (~90 MB/s required), and then executed by flushing the GPU Instruction cache and jumping. BTW, some closed source miners are using this method to run optimized kernels.
    The instruction cache is 16 KB shared with 4 CU (32 KB on Vega?), so we have 16/4/2 -> 2 KB of instruction cache available per active thread. That will be short to store the 1900 instructions required inside the main loop (but will be helped with the L2 cache and the fact that 1/4th of the waves are doing something).
  • final hash not implemented, but should be very fast on a GPU.

Now the results:

  • Vega56 at 1630Mhz, about 9700 h/s
  • Vega56 at 1138Mhz, about 6800 h/s
  • Fury R9 1050MHz, 5600 h/s

I have seen those things:

  • dataset loading has no impact when properly scheduled. This is expected since GPU ram is optimized for consecutive data loading
  • the bandwidth for L2/L3 access it about 2.6 GB/s (18 L2/L3 access per loop x 2048 * 8(bytes) x 9700 h/s). This is confirmed by the fact that h/s scales with GPU core frequency.

I checked the consistency of the number of executed GPU instructions:

  • 9700 h/s
  • times 2048
  • times around 1900 ASM instructions for 256 vm instructions (might be lower with hand made asm generation like for x86/arm)
  • times 4 clocks per cycle
  • divided by 1.6 Ghz
  • divided by 112 cores
    This gives .78 which sounds ok given double FP add some clocks

Hope I didn't made a huge mistake, but memory bandwidth and number of executed instructions looks ok.
In a "real" miner, I only see L1 Instruction cache as a potential limiting factor. Initially I was thinking of L2/L3, but HBM is really fast. L1 instruction cache misses will be filled from the L2 cache which is apparently not overloaded. Vega10 has also improved instruction cache.
Needs more investigations so see how the L1 instruction cache impacts performance.

I can put the code (Vulkan) in my github if you want to have a look at this prototype.

Program overhead

Machine code of a RandomX instruction:

Imgur

This ADD_64 instruction needs a total of 60 bytes of code, of which only one 3-byte instruction is the actual addition operation.

The overhead code consists of:

  • Checking the termination condition (bd0, bd2). This is required because of control flow instructions like JUMP/CALL make the sequence of instructions unpredictable.
  • Scratchpad read address generation (bd8, bdf, bec)
  • Reading from the dataset (be2, be5, be7)
  • Scratchpad write address generation (bf8-c03)

This overhead creates a lot of unavoidable dependency chains which limit the throughput to ~2 instructions per clock.

I've been thinking about a different structure of the program, which somewhat resembles CryptonightR by @SChernykh.

The program would be a loop exactly long enough to fit into the uOP cache of new CPUs (about 1000 instructions in total). Reading from the scratchpad would be done at the beginning of the loop - all 16 registers would be loaded at once. Then only register-register instructions would be executed (long chain) and at the end, registers would be written to the scratchpad and the loop would continue from the beginning.

Since the instruction reorder buffer of x86 CPUs is >100 instructions, they could essentially eliminate all dependency chains in the loop body and execute close to 4 instructions per clock.

Initial tests show that we could generate and execute 16 chained programs with this structure and still get a performance boost over the old scheme.

Advantages:

  • higher instruction throughput (= higher latency for in-order ASIC designs)
  • lower power consumption due to the use of the uOP cache (decoders can be powered off)
  • 16 chained programs to eliminate nonce skipping

Disadvantages

  • lower number of random accesses - 8 registers are loaded at once
  • no JUMP/CALL/RET instructoins are possible - this is somewhat compensated by the fact that we are generating 16 different programs

Floating point divisor precision

Currently, the divisor used in the FDIV_M instruction only has a 30-bit mantissa. This means an ASIC could use a simpler divider for this instructions than if the full 52-bit precision was used.

Solution: Fill the lowest 22 bits of the mantissa with a program-specific bit pattern.

AESGenerator4R behaves like the ECB mode

Issue reported in the context of Kudelski Security's audit

AESGenerator4R processes each 16-byte chunk of the generator's state independently, in a way that if state0 == state2 then the transformed values will be identical as well (same for 1 and 3).

The designers may want to avoid this property by using a mode similar to CTR or CBC. If CBC is used, CBC'd decryption mode might be used instead of the encryption mode—because the former is parallelizable, whereas the former is not, which allows for example to exploit Zen's two parallel AES units.

License Compatibility

It would seem this code is presently licensed under GPLv3, which would require releasing any project including RandomX, such as Monero, under a GPLv3 license.

Possible ASIC design

EDIT: This design is outdated and no longer applies to the current RandomX version.

Similar idea was originally proposed by @cjdelisle for a GPU miner, but I think it's more applicable for an ASIC design.

RandomX ASIC miner:

  • ~66 MiB of SRAM on-chip
  • 4 GiB of HBM memory
  • 1 dedicated core for dataset expansion
  • 256 small decoder/scheduler cores
  • 28 worker cores

During dataset expansion, the SRAM is used to store the 64 MiB cache, while dataset blocks are loaded into HBM memory.

When mining, the ASIC runs 256 programs in parallel.
Memory allocation:

  • 256 * 256 KiB scratchpad = 64 MiB of SRAM for scratchpads
  • 256 * 8 KiB = 2 MiB of SRAM for program buffers
  • 32 KiB for register files
  • ~256 KiB for program stack (1 KiB per program) = maximum recursion depth of 64

Instructions are loaded into the decoder/scheduler core (each core has its own program buffer, program counter and register file). The scheduler cores handle only CALL and RET instructions and pass the rest to one of the 28 worker cores.

Each of the 28 worker cores implements exactly one RandomX instruction pipelined for throughput which matches the instruction weight (for example MUL_64 worker can handle 21 times more instructions per clock than DIV_64 worker). Each worker has an instruction queue fed by the scheduler cores.

The speed of the decoder/scheduler cores would be designed to keep every worker core 100% utilized.

Some complications of the design:

  • CALL/RET will stall the program if there is dependency on an instruction which is still in worker queue
  • FPROUND will stall all subsequent floating point instructions
  • instructions which use the same address register will create a dependency chain

There could be also some dedicated load/store ports for loading instruction operands.

If limited only by HBM bandwidth, this ASIC could do around 120 000 programs per second (480 GiB/s memory read rate), or roughly the same as 30 Ryzen 1700 CPUs. This assumes that sufficient compute power can fit on a single chip. If we estimate power consumption at 300 W (= Vega 64), this ASIC would be around 9 times more power efficient than a CPU.

Improved estimate: #11 (comment)

Dislaimer: I'm not an ASIC designer.

ERROR: Cache allocation failed (largePages)

For people looking up ERROR: Cache allocation failed whilst using --largePages:

dsc@dev6:~/zz$ ./benchmark --nonces 5000 --mine --largePages --init 12 --threads 12
RandomX benchmark
 - full memory mode (2080 MiB)
 - interpreted mode
 - hardware AES mode
 - large pages mode
Initializing (12 threads) ...
ERROR: Cache allocation failed

strace:

mmap(NULL, 268435456, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE|MAP_HUGETLB, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7f19d15d81a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(1, "ERROR: Cache allocation failed\n", 31ERROR: Cache allocation failed

MAP_HUGETLB -> allocation fails -> edit /proc/sys/vm/nr_hugepages and set it to some high number (I put mine on 1500). This fixed it.

sudo sysctl -w vm.nr_hugepages=1500

taking advantage of modern CPU stuff

So in the effort to reduce the gap between a consumer CPU and a custom CPU, one approach is exploiting some of the stuff baked in to modern CPUs. This probably just buys us time, similar to the AES-NI of cryptonight.

For instance, does randomX do things that take advantage of

CLMUL instruction set

or any of the things on the extension list?

https://en.wikipedia.org/wiki/Kaby_Lake

cmake: /wmmintrin.h:61:1: error: inlining failed in call to always_inline

Current CMakeLists.txt has some issues, which make randomx difficult to use as a module in other project.

n file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h:43:0,
                 from /root/RandomX/src/intrin_portable.h:89,
                 from /root/RandomX/src/soft_aes.h:32,
                 from /root/RandomX/src/aes_hash.cpp:29:
/usr/lib/gcc/x86_64-linux-gnu/5/include/wmmintrin.h: In function 'rx_vec_i128 aesenc(rx_vec_i128, rx_vec_i128) [with bool soft = false; rx_vec_i128 = __vector(2) long long int]':
/usr/lib/gcc/x86_64-linux-gnu/5/include/wmmintrin.h:61:1: error: inlining failed in call to always_inline '__m128i _mm_aesenc_si128(__m128i, __m128i)': target specific option mismatch
 _mm_aesenc_si128 (__m128i __X, __m128i __Y)
 ^
In file included from /root/RandomX/src/aes_hash.cpp:29:0:
/root/RandomX/src/soft_aes.h:40:65: error: called from here
  return soft ? soft_aesenc(in, key) : rx_aesenc_vec_i128(in, key);

The error is due to the fact that ARCH_ID is not set as this variable is specific to monero. Maybe as a fallback, if not explicitly set by the parent project it could be set to ${CMAKE_HOST_SYSTEM_PROCESSOR}?

Also there is no cmake_minimum_required provided which results in warnings when build randomx using cmake.

Possible updated version of the CMakeLists.txt is here:
https://github.com/moneroexamples/RandomX/blob/update_cmake/CMakeLists.txt

The cmake_minimum_required provided is same as for the monero.

With the changes, randomx can also be build using cmake:

cd RandomX
mkdir build && cd build
cmake ..
make 

Compiled VM

As a first step towards a fully compiled VM, I have added an assembly generator for RandomX programs. At the moment, it's available only for 64-bit Windows.

A sample program is located in https://github.com/tevador/RandomX/blob/master/src/program.inc

Assembly code can be generated with randomx --genAsm [nonce].

Initial tests show that the compiled VM is approximately 5-6 times faster than the interpreter (doesn't include compilation time). Compiled program size is ~26 KiB, so it fits into L1I cache.

TODO/FIX: Output from the compiled VM doesn't match the output of the reference interpreter.

Comparison of verification time (CNv0, CN-R, RandomX, RandomX-light)

Considering the ongoing discussions, I think a comparison between CNv-8, CN-R, RandomX and RandomX light block verification times could come in handy to clear up misconceptions and speed up the process. I don't know how to do CNv-8 and CN-R and compare them to the RandomX verification mode test.

Lock if deemed unnecessary.

Performance and portability testing

Please post:

  1. Your CPU, RAM parameters, operating system and compiler (if you compiled yourself).
  2. Verification mode results.
  3. Mining mode results (optional).

Precompiled binaries are available for Windows 64-bit and Ubuntu/Debian x86-64. Download the latest release here: https://github.com/tevador/RandomX/releases

Verification mode

Run as:

randomx --verify --softAes
  • If your CPU is from the x86_64 family, add the --jit option to get 2-3 times faster verification.
  • If you have a recent x86 CPU which supports AES-NI, you may omit the --softAes option to get a small increase in performance.
  • If your system supports large pages, you can add --largePages option to get a small increase in performance.

Mining mode

Mining mode is currently only supported on 64-bit x86 CPUs. Requires at least 2.25 GiB of RAM.
Run as:

randomx --mine --init Q --threads T --nonces N
  • Select Q (number of initialization threads) equal to the number of hardware threads of your CPU.
  • Find T (number of mining threads) which produces the highest hashrate. Starting point should be 1 thread per 2 MB of L3 cache, but some CPUs can benefit from running more threads, while some CPUs cannot run more than 1 thread per core efficiently depending on other factors such as L1/L2 cache sizes.
  • Select N (number of nonces) equal to 10000, 100000 or 1000000 depending on the performance of your system. Aim for at least a 60-second mining period for accurate results.
  • If your CPU doesn't support AES-NI, you have to add the --softAes option. Your mining performance will be about 40% lower.
  • If your system supports large pages, you can add --largePages option to get a significant increase in performance (up to 25%).
  • If you have a NUMA system, run one instance of RandomX per NUMA node.

Single-chip ASIC design

I came across this discussion on the Grin forum: https://www.grin-forum.org/t/obelisk-grn1-chip-details/4571

The conversation contains a lot of interesting information. The main takeway is that ASIC manufacturers prefer single chip designs even if it may not seem as the best way.

For RandomX, I see two possible options how an efficient ASIC might look like:

  1. A single-chip design with lots of SRAM, similar to the GRIN ASIC. It would run in the "light" mode with the 256 MiB cache entirely on-die.
  2. A multi-chip design with HBM2 memory. It would run in the "full" mode with the 2 GiB dataset and run thousands of concurrent threads.

I did a ballpark estimate of the performance and power consumption of the first option (single-die RandomX ASIC). These are my assumptions, mainly based on the information from the Grin discussion:

  1. Largest feasible die size is about 800 mm2.
  2. The ASIC would use TMSC 16 nm process to maximize the yield.
  3. SRAM requires about 1 mm2 of die area per MiB.
  4. Power consumption is about 0.1 W/mm2 for SRAM and 0.4 W/mm2 for cores.
  5. 0.6 mm2 per CPU-like in-order core (based on ARM A53).
  6. All RandomX instructions have a 1-cycle latency.
  7. AES round has a 1-cycle latency.
  8. Scratchpad access has a 1-cycle latency.
  9. Cache access has a 4-cycle latency.
  10. One SquareHash round (multiplication + subtraction) has a 2-cycle latency.
  11. ASIC runs at 1 GHz.

I think some of the above assuptions are rather optimistic, but I'm interested in the best possible scenario here.

First, let's estimate the time to calculate 1 hash. This requires:

  • 10 Blake2b calculations. We will disregard these.
  • 65570 sequential AES rounds (Scratchpad init+finalizer).
  • 16384 program iterations.

One program iteration consist of:

  • 256 instructions
  • 55 scratchpad accesses
  • 8 Cache accesses
  • 336 SquareHash rounds

This gives 311 cycles for the program execution 704 cycles to calculate the dataset item. Overall 1015 cycles per iteration, which is about 1 μs.

Total time per hash is about 16.7 ms, which means exactly 60 H/s per scratchpad. We can see that 2/3 of the time, the core is waiting for the dataset item to be calculated, so we can have 1 core per 3 scratchpads to keep all cores busy 100% of the time.

To max-out our 800 mm2 die, we will have:

  • 256 MiB of Cache (256 mm2)
  • 83 cores (50 mm2)
  • 249 scratchpads (498 mm2)

Final performance of this chip is 249*60 = 15 KH/s.

Power consumption estimate is 754*0.1 + 50*0.4 = 95 W.

Power efficiency is about 150-160 H/J. While this is about 3x more than a CPU, I would consider this as the upper bound of ASIC efficiency given the above assumptions may not be feasible in practice.

Doubling RANDOMX_DATASET_BASE_SIZE to 4294967296 causes hash outputs to differ depending on what flags were set.

Steps to reproduce:

  • Build RandomX with a RANDOMX_DATASET_BASE_SIZE of 4294967296.
  • Build the following program:
#include "src/randomx.h"
#include <iostream>
#include <iomanip>

int main() {
	const char myKey[] = "RandomX example key";
	const char myInput[] = "RandomX example input";
	char hash1[RANDOMX_HASH_SIZE];
	char hash2[RANDOMX_HASH_SIZE];

	randomx_cache *myCache1 = randomx_alloc_cache((randomx_flags) (RANDOMX_FLAG_JIT));
	if (myCache1 == nullptr) {
		std::cout << "Cache allocation failed" << std::endl;
		return 1;
	}
	randomx_init_cache(myCache1, myKey, sizeof myKey);

	randomx_vm *myMachine1 = randomx_create_vm((randomx_flags)(RANDOMX_FLAG_JIT), myCache1, nullptr);
	if (myMachine1 == nullptr) {
		std::cout << "Failed to create a virtual machine" << std::endl;
		return 1;
	}

	randomx_calculate_hash(myMachine1, &myInput, sizeof myInput, hash1);

    randomx_release_cache(myCache1);
	randomx_destroy_vm(myMachine1);

	randomx_cache *myCache2 = randomx_alloc_cache((randomx_flags) (RANDOMX_FLAG_DEFAULT));
	if (myCache2 == nullptr) {
		std::cout << "Cache allocation failed" << std::endl;
		return 1;
	}
	randomx_init_cache(myCache2, myKey, sizeof myKey);

	randomx_vm *myMachine2 = randomx_create_vm((randomx_flags)(RANDOMX_FLAG_DEFAULT), myCache2, nullptr);
	if (myMachine2 == nullptr) {
		std::cout << "Failed to create a virtual machine" << std::endl;
		return 1;
	}

	randomx_calculate_hash(myMachine2, &myInput, sizeof myInput, hash2);

    randomx_release_cache(myCache2);
	randomx_destroy_vm(myMachine2);

	for (unsigned i = 0; i < RANDOMX_HASH_SIZE; ++i)
		std::cout << std::setfill('0') << std::setw(2) << std::hex << ((int) hash1[i] & 0xff);

	std::cout << std::endl;

	for (unsigned i = 0; i < RANDOMX_HASH_SIZE; ++i)
		std::cout << std::setfill('0') << std::setw(2) << std::hex << ((int) hash2[i] & 0xff);

	std::cout << std::endl;

	return 0;
}
  • Run the built program.

I've noticed this with not just the JIT flag, and believe all flags create this problem.

Unoptimized BLAKE2b

Issue reported in the context of Kudelski Security's audit

The implementation does not leverage vectorized instructions. For example, on platforms supporting AVX2, a reference, portable implemnentations is about 40% slower than an AVX2 implementation, as reported on a Cannonlake microarchitecture benchmark from SUPERCOP.

An AVX2 implementation of BLAKE2b can be found in the SUPERCOP archive as well as in Libsodium.
An AVX512-optimized version of BLAKE2s (not BLAKE2b) is used in Wireguard.
Similar techniques may be used to optimize BLAKE2b for the AVX512 instruction set.

"Operations that preserve entropy" 🤔

Issue reported in the context of Kudelski Security's audit

The design document states:

RandomX uses all primitive integer operations that preserve entropy: addition (IADD_RS, IADD_M), subtraction (ISUB_R, ISUB_M, INEG_R), multiplication (IMUL_R, IMUL_M, IMULH_R, IMULH_M, ISMULH_R, ISMULH_M, IMUL_RCP), exclusive or (IXOR_R, IXOR_M) and rotation (IROR_R, IROL_R).

Here "preserve entropy" is probably meant as "is invertible if one of the operands is fixed". However this is not the case for multiplication. I suggest to clarify that part.

CBRANCH register selection favors non-speculative execution

The condition register of the CBRANCH instruction is selected to be the least recently modified register. This maximizes the length of the jump, but has an unwanted side effect by allowing the condition to be evaluated in advance.

I made some simulations to estimate the benefits of speculative execution in RandomX and it turns out that in a large number of cases, speculation is not necessary because the condition result is already known by the time execution arrives at the branch.

Example program: https://pastebin.com/hdSJyuUF
The number in front of each instruction is the cycle when the result of the instruction is known. Execution was simulated as a non-speculative out-of-order ASIC with a 3-cycle pipeline and 4 execution ports (notice there is at least a 3-cycle delay between each CBRANCH instruction and the next instruction).

Here the non-speculative ASIC needed 71 cycles to execute this program. A speculative ASIC would reduce this to 64 cycles.

I propose to remove the complex register selection algorithm in the CBRANCH instruction and instead simply use the existing dst bit field to select the condition register. The branch target will be still the instruction following the one when the condition register was last modified.

This change increases the performance gap between a speculative and non-speculative ASIC:

current proposed
non-speculative 76.3 cycles 84.4 cycles
speculative 64.7 cycles 64.8 cycles
spec-ex advantage 18% 30%

(table lists the average execution time of 10000 random programs)

The rationale for this change is that non-speculative ASIC designs are more likely due to their simplicity and lower power consumption. Most high-performance CPUs utilize speculative execution, so this change would increase the ASIC resistance of RandomX.

Side effects of this change:

  • No measurable impact on CPU performance
  • The average combined length of all jumps in a program is decreased from 215 to 122. This is still high enough to prevent predicated execution.

Can't start benchmark

At first I tried to manually compile it using the code below:

sudo apt-get update
sudo apt-get install build-essentials
git clone https://github.com/tevador/RandomX.git
cd RandomX
cmake .
make
cd bin
./benchmark

This has resulted in

ERROR: Dataset allocation failed
Then trying it a second time resulted in
ERROR: Cache allocation failed

So I tried to install the latest precompiled release which gave me this

./benchmark: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by ./benchmark)

And I fixed that using sudo apt-get install --only-upgrade libstdc++6 Then it started giving me

ERROR: Dataset allocation failed

again.

Not sure what's going on here.

Need ARMv8 support

Are you planning to write the ARM support or do we need someone to do this?
It's not clear to me how much needs to be ported at the moment.

RandomX Emergency Failsafe: Force ASICs to be useful with Random Lossless Data Compression Algorithms

Hi @tevador, In the event RandomX Fails to deter ASICs,
I've got an idea to extend the RandomX lifespan even further.

Emergency Failsafe / Backup proposal:

  • Integrate "Random Lossless Data Compression Algorithms" into the RandomX Program sequence, forcing the creation of extremely useful and commoditized 'Lossless Data Compression ASICs'.

How This Might Work:

  • If RandomX is declared as "Failed", and the devs choose to go ASIC friendly,
  • Lossless Data compression algorithms can be integrated into the RandomX VM program sequence (various Lossless Data Compression Algorithms to choose from).
  • That will force ASIC manufacturers to make an Lossless data compression ASIC that can work with Random Data (coming from RandomX's VM Randomization engine)
  • Because these ASICS can handle random data, Data can be re-routed through the ASIC for on the fly data compression in Real world applications outside mining.
  • ASICs can't cheat as the Dataset is Randomly generated based on RandomX's engine.

There would be Huge Benefits for the Entire computing industry

  • If ASICs can do "Random Lossless data compression" on the fly, it would be "Incredibly useful" for the entire computer industry.
  • Lossless Data compression ASICs would be decentralized and heavily commoditized on every computer and phone on the planet, built into all computers, SSDs, HDDs, Phones, Servers, Data centres, TVs, Internet routers to reduce Data storange and bandwidth by doing lossless data compression on the fly in real time, Reducing Internet bandwidth, Youtube videos download, File sharing, SSD/HDD storage costs etc,

I'm thinking this would help the world far more than SHA3.

Could this be done with RandomX's Randomization VM engine?

Thanks,

Design considerations to make potential ASICs more general purpose

The most common critique of any effort to create an ASIC-proof PoW is that any PoW can be made into an ASIC. The general thought has been to design the PoW such that the eventual ASIC is as costly or efficient as consumer grade hardware.

What if instead we consider design choices that would make a potential ASIC useful for other applications. This would affect the real-world side of the ASIC-problem - manufacturing centralization, distribution.

Basically, a CPU is going to run RandomX at 100%. A RandomX ASIC could possibly run RandomX at 200%. But what if a randomX co-processor could be integrated with commodity CPUs to make the final consumer product (some next gen CPU or motherboard) perform consumer-grade tasks at 170% of a standard CPU?

Its effectively the same model as integrated graphics.

So instead of an ASIC manufacturer only having the option of making RandomX ASICs to mine monero, the same circuitry (or IP) could be packaged (or sold) for other purposes. This would get us to the commoditization faster - because what is more profitable if you stumble across a genius design to hack RandomX, printing a bunch of hardware and setting up mines or just selling the IP to intel or AMD? From what I know, these companies are struggling to find ways to make new CPUs better and faster.

For instance, this type of design choice may re-open #2 , because any advances in branch prediction would provide a boon for any situation where a CPU is tasked with running an operating system, or anything (if I understand things correctly).

ARM Rounding

I've been doing some portability tests and found out that the ARM instruction for conversion between integer and floating point (VCVT) uses a hardcoded rounding mode rather than the rounding mode that is set in the FPSCR register. Reference on page 4-22..

This causes a discrepancy between the results from x86 and ARM (confirmed with Raspberry Pi 3).

Possible solutions:

  1. Keep the current integer to float conversion. This would force ARM to fix the rounding in software. Possible high impact on performance for ARM.
  2. Perform the conversion always in the default rounding mode. This would force x86 to switch rounding modes between conversion and the actual operation (addition, subtraction etc.). Possible high impact on performance for x86.
  3. Remove the FROUND instruction altogether and only use the default rounding mode round-to-nearest. This would have no impact on performance but would decrease ASIC resistance.
  4. Remove the conversion step from FPU instructions and treat the arguments directly as floating point values. Possible issues with NaN, inifinity and denormal values. Unknown impact on performance and randomness of the results.

Not sure which solution is the best. Thoughts?

Compilation failure on MacOS Mojave

MacOS Mojave, Apple LLVM version 10.0.0

Compilation error on both master@6e8c83fd and dev@6b344b81:

$ make
cc -march=native -O3 -flto -c src/argon2_core.c -o obj/argon2_core.o
cc -march=native -O3 -flto -c src/argon2_ref.c -o obj/argon2_ref.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/AssemblyGeneratorX86.cpp -o obj/AssemblyGeneratorX86.o
cc -march=native -O3 -flto -c src/blake2/blake2b.c -o obj/blake2b.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/CompiledVirtualMachine.cpp -o obj/CompiledVirtualMachine.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/dataset.cpp -o obj/dataset.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/JitCompilerX86.cpp -o obj/JitCompilerX86.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/instructionsPortable.cpp -o obj/instructionsPortable.o
src/instructionsPortable.cpp:22:26: warning: expected 'ON' or 'OFF' or 'DEFAULT' in pragma [-Wunknown-pragmas]
#pragma STDC FENV_ACCESS on
                         ^
1 warning generated.
c++ -std=c++11 -maes -march=native -O3 -flto -c src/Instruction.cpp -o obj/Instruction.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/InterpretedVirtualMachine.cpp -o obj/InterpretedVirtualMachine.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/main.cpp -o obj/main.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/Program.cpp -o obj/Program.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/softAes.cpp -o obj/softAes.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/VirtualMachine.cpp -o obj/VirtualMachine.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/Cache.cpp -o obj/Cache.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/virtualMemory.cpp -o obj/virtualMemory.o
cc -march=native -O3 -flto -c src/reciprocal.c -o obj/reciprocal.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/LightClientAsyncWorker.cpp -o obj/LightClientAsyncWorker.o
c++ -std=c++11 -maes -march=native -O3 -flto -c src/hashAes1Rx4.cpp -o obj/hashAes1Rx4.o
c++ -x assembler-with-cpp -c src/JitCompilerX86-static.S -o obj/JitCompilerX86-static.o
src/JitCompilerX86-static.S:41:8: error: invalid alignment value
.align 64
       ^
src/JitCompilerX86-static.S:45:8: error: invalid alignment value
.align 64
       ^
src/JitCompilerX86-static.S:48:8: error: invalid alignment value
.align 64
       ^
src/JitCompilerX86-static.S:67:8: error: invalid alignment value
.align 64
       ^
src/JitCompilerX86-static.S:71:8: error: invalid alignment value
.align 64
       ^
make: *** [obj/JitCompilerX86-static.o] Error 1

Instruction weights

The provisional instruction weights are following:

#define WT_ADD_64 10
#define WT_ADD_32 2
#define WT_SUB_64 10
#define WT_SUB_32 2
#define WT_MUL_64 21
#define WT_MULH_64 10
#define WT_MUL_32 15
#define WT_IMUL_32 15
#define WT_IMULH_64 10
#define WT_DIV_64 1
#define WT_IDIV_64 1
#define WT_AND_64 4
#define WT_AND_32 2
#define WT_OR_64 4
#define WT_OR_32 2
#define WT_XOR_64 4
#define WT_XOR_32 2
#define WT_SHL_64 3
#define WT_SHR_64 3
#define WT_SAR_64 3
#define WT_ROL_64 6
#define WT_ROR_64 6
#define WT_FPADD 20
#define WT_FPSUB 20
#define WT_FPMUL 22
#define WT_FPDIV 8
#define WT_FPSQRT 6
#define WT_FPROUND 2
#define WT_CALL 24
#define WT_RET 18

(Divide weight by 256 to get the probability that the instruction is generated.)

The rationale behind these values:

  • Minimum weight of an instruction is 2 to decrease the chance of finding a program without the instruction (DIV_64 and IDIV_64 together have weight 2). The chance is ~1.8% with a weight of 2 and ~13.5% with a weight of 1.
  • Weights of 32-bit instructions (ADD_32, SUB_32, ...) are minimized to limit the number of zeroes in the results.
  • About every 6th instruction is a branch (CALL/RET).
  • CALL has a higher weight that RET to limit the number of RET instructions with empty stack (currently the chance is about 7%).
  • Highest weight is given to fast complex instructions: multiplication and floating point.
  • MULH instructions have a slightly lower weight because they produce only a 32-bit result when one of the inputs is 32-bit and zero when both inputs are 32-bit.
  • High-latency instructions (DIV_64, IDIV_64, FPDIV, FPSQRT) have lower weights.

Potential issues:

  • FPADD and FPSUB become a noop when one of the operands is small.
  • Some values on the stack will never be used because RET is less common than CALL.

Does anyone have ideas for a better instruction mix to maximize ASIC resistance?

Performance testing

I have published a RandomX x86 benchmark. It tests all ALU and FPU VM instructions.

Download the release here: https://github.com/tevador/RandomX/releases (precompiled binaries available for Windows x64 and Linux x64).

Paste the result directly into this thread together with your CPU specs and platform.

Intel Core i5-3230M (Ivy Bridge, 2013), Windows 7 x64

operation time [s] (result)
NOOP 0.65806 13901818124319359487
ADD_64 0.7371 17981554887921175522
ADD_32 0.740774 4040882173786131426
SUB_64 0.782932 2297734475728124768
SUB_32 0.749162 4040882150066715488
MUL_64 0.82144 13108172643246038454
MULH_64 0.974916 4528287945977545198
MUL_32 0.832865 3505774372494007734
IMUL_32 0.832609 111758659619151286
IMULH_64 0.978652 14335428980581912669
DIV_64 9.17839 4040949655748476752
IDIV_64 9.79582 10706389806209703347
AND_64 0.737244 17685105886084186988
AND_32 0.752489 4040880554977510252
OR_64 0.740858 3708075074809940201
OR_32 0.754382 4040882048943442153
XOR_64 0.737324 11373760495026296796
XOR_32 0.751642 4040882115245103068
SHL_64 0.977999 13807543856509042205
SHR_64 1.00519 18303742120199201659
SAR_64 1.00618 13476444829495768773
ROR_64 0.966894 10563782527368126285
ROL_64 0.966994 15157809728722651568
FNOOP 1.1203 11251318250348714849
FADD 1.2373 14977738299044271065
FSUB 1.23676 16485221503511343178
FMUL 1.23883 5358969898691635935
FDIV 4.37206 4466393864116852366
FSQRT 4.33514 3437575554930583295
FROUND 2.01637 11251318250348714849

Intel Core i3-3220 (Ivy Bridge, 2012), Ubuntu 16.04 x64

operation time [s] (result)
NOOP 0.516929 13901818124319359487
ADD_64 0.638101 17981554887921175522
ADD_32 0.745153 4040882173786131426
SUB_64 0.592502 2297734475728124768
SUB_32 0.590001 4040882150066715488
MUL_64 0.638484 13108172643246038454
MULH_64 0.851473 4528287945977545198
MUL_32 0.607774 3505774372494007734
IMUL_32 0.607894 111758659619151286
IMULH_64 0.851494 14335428980581912669
DIV_64 6.97456 4040949655748476752
IDIV_64 7.7931 10706389806209703347
AND_64 0.638668 17685105886084186988
AND_32 0.744427 4040880554977510252
OR_64 0.638064 3708075074809940201
OR_32 0.744875 4040882048943442153
XOR_64 0.638116 11373760495026296796
XOR_32 0.744428 4040882115245103068
SHL_64 0.738194 13807543856509042205
SHR_64 0.729493 18303742120199201659
SAR_64 0.729931 13476444829495768773
ROR_64 0.729517 10563782527368126285
ROL_64 0.729536 15157809728722651568
FNOOP 0.733196 11251318250348714849
FADD 0.76531 14977738299044271065
FSUB 0.765336 16485221503511343178
FMUL 0.804609 5358969898691635935
FDIV 3.43021 4466393864116852366
FSQRT 3.41546 3437575554930583295
FROUND 1.54395 11251318250348714849

AMD Sempron 2800+ (Palermo, 2004), Ubuntu 18.04 x64 (One of the oldest CPUs you can run this benchmark on)

operation time [s] (result)
NOOP 1.37092 13901818124319359487
ADD_64 1.62194 17981554887921175522
ADD_32 1.79334 4040882173786131426
SUB_64 1.6724 2297734475728124768
SUB_32 1.4496 4040882150066715488
MUL_64 1.74611 13108172643246038454
MULH_64 2.78505 4528287945977545198
MUL_32 1.97347 3505774372494007734
IMUL_32 1.90647 111758659619151286
IMULH_64 2.48499 14335428980581912669
DIV_64 36.962 4040949655748476752
IDIV_64 37.9536 10706389806209703347
AND_64 1.84943 17685105886084186988
AND_32 1.74499 4040880554977510252
OR_64 1.62137 3708075074809940201
OR_32 1.74484 4040882048943442153
XOR_64 1.61982 11373760495026296796
XOR_32 2.20235 4040882115245103068
SHL_64 1.62928 13807543856509042205
SHR_64 1.63084 18303742120199201659
SAR_64 1.62931 13476444829495768773
ROR_64 1.63083 10563782527368126285
ROL_64 1.62921 15157809728722651568
FNOOP 2.39553 11251318250348714849
FADD 2.79056 14977738299044271065
FSUB 2.64978 16485221503511343178
FMUL 2.65083 5358969898691635935
FDIV 8.97491 4466393864116852366
FSQRT 15.2086 3437575554930583295
FROUND 7.25179 11251318250348714849

Portability and correctness

I have pushed an interpreter for RandomX.

The aim of the interpreter is portability and correctness, not speed.

Tested on the following platforms: Windows (x86_64), Windows (x86), Linux (x86_64), ARM (armv7+neon).

Run the test as ./randomx [--lightClient] [--softAes] [numberOfPrograms=1000]

--lightClient is for a "light client" mode, which requires only ~64 MiB of memory (full client requires >4 GiB). --softAes uses software AES.

For 1000 programs (default), the output hash must be 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e (applies to commit c9102ee).

There is also a unit test just for ALU and FPU operations: bin/AluFpuTest .

Windows 64-bit (Core i5-3230M)

> randomx.exe
Initializing...
Dataset (4 GiB) initialized in 25.4644 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 45.9957 programs per second
> randomx.exe --softAes
Using software AES.
Initializing...
Dataset (4 GiB) initialized in 53.0843 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 45.9795 programs per second
> randomx.exe --lightClient
Initializing...
Cache (64 MiB) initialized in 1.0757 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 16.0407 programs per second
> randomx.exe --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 1.01232 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 9.446 programs per second

Windows 32-bit (Core i5-3230M)

> randomx.exe --lightClient
Initializing...
Cache (64 MiB) initialized in 2.77849 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 9.43859 programs per second
> randomx.exe --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 2.78784 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 6.89343 programs per second

Ubuntu x86_64 (Ryzen 7 1700)

$ bin/randomx
Initializing...
Dataset (4 GiB) initialized in 19.8195 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 58.5224 programs per second
$ bin/randomx --softAes
Using software AES.
Initializing...
Dataset (4 GiB) initialized in 39.8167 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 58.5863 programs per second
$ bin/randomx --lightClient
Initializing...
Cache (64 MiB) initialized in 0.610684 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 21.1308 programs per second
$ bin/randomx --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 0.607081 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 12.4759 programs per second

ARMv7 + NEON (Raspberry Pi 3)

$ bin/randomx --lightClient --softAes
Using software AES.
Initializing...
Cache (64 MiB) initialized in 8.2832 s
Running benchmark (1000 programs) ...
Cumulative output hash: 74511dda1088b3e74f7bd91b6e35f4d39ad73ffaa40619cec0d896abadbd6e1e
Performance: 1.32003 programs per second

Possibility of increasing RAM requirement to 6 GB?

Because of the 4 GB requirement both webmining and botnets are hindered. Botnets either don't have the necessary RAM or it will be too disruptive to the machine performance, thus harder to stay undetected.

How much do we know about the hardware capabilities of existing botnets? What systems tend to get infected? A lot of computers out there are 4 GB or less RAM, with no OS or apps of any kind, they will be hosed by the 4 GB. What kind of apps do 8 and 16 GB machines generally run and how disruptive is 4GB extra use, especially to 8 GB machines?

Will a purpose built CPU miner want dual channel RAM? With good timings?
You'll go 2x4 GB. Windows 10 with with nothing else running should not use more than 1.5 GB (with free ram, win 10 will allocate it even when not necessarily needed), maybe you can even strip it down or go with some kind of a linux solution (not pirating windows is an unnecessary cost and you shouldn't pirate). Would 6GB + 1.5 GB + 500 MB free be cutting it too close?

Is the 4 GB also taking into consideration people mining with their PC? 4GB machines are screwed anyway and 6GB is pretty much nonexistent. Looking at steam HW survey https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam

8GB 37.90%
16+GB ~35%

Is mining with an 8GB machine while it is not Idle even viable with 4GB miner RAM usage? If you have spare RAM, how many CPU cores can you reasonably spare? Pretty much all of the 4C processors have no HT. It used to be an i7 K exclusive feature (now only i9, with more cores).

The idea would be to not make it more expensive to build a bare-bones miner (8GB needed anyway, should we waste ram?) while possibly making it harder for infected machines to mine and stay undetected.

Do we know what kind of a performance impact dual vs single channel ram has and also different frequencies and timings? Can we make it favor faster ram?

Unique scratchpad addresses

Since there are only 8 integer registers, there is a good chance that two memory operations will end up with the same address. This happens in almost every program. For example in the sample program, instructions 108 and 111 read exactly the same scratchpad location. Same thing happens with store operations, in which case the first ISTORE instruction can be optimized away.

Proposed solution: use imm32 as an instruction-specific displacement. The adress calculation will change from:

mov eax, r8d
and eax, 16376

to (for example)

lea eax, [r8+1860630909]
and eax, 16376

This adds one ALU operation per memory instruction, which should have only a small impact on CPU performance.

AESHash1R is not a secure hash function (and that's fine?)

Issue reported in the context of Kudelski Security's audit

Collisions and preimages for AESHash1R can be found instantaneously, since the message is input as 16-byte blocks xored once with part of the state. (The attack is straightforward given the descriptions of AESHash1R.)

This cryptographic insecurity is likely understood by the authors, we nonetheless believe that AESHash1R should be a secure hash function, for otherwise one can easily determine Scratchpad values that map to a given hash. May not be exploitable but why take the risk?

Furthermore, in our understanding this part of the PoW is not performance critical, so the BLAKE2b-based hash may be used directly instead.

FP register loading

Memory operands are loaded as 8-byte values from the address indicated by src. The 8 byte value is interpreted as two 32-bit signed integers and implicitly converted to floating point format

64-bit FP format has 52 bit precision, so low 20 bits of all FP registers will be 0 right after loading. ASIC can have optimizations for this, especially for read-only FP registers. We need to load 64-bit signed integers and convert them to double precision FP with current rounding mode to fill all bits of FP registers.

largePages issue on Dell R820 4x E5-4640 Xeon / ubuntu v16.04

Hello,
Anytime I set the --largepages option, i get an error. am I making a mistake? thanks for your time and effort.
command to set hugepages: sysctl -w vm.nr_hugepages=128

root@DELL-R820-E5-03:/randomx/RandomX/bin# gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.1.0-10ubuntu1
16.04.york0' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.1.0 (Ubuntu 7.1.0-10ubuntu1~16.04.york0)

root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --largePages
RandomX benchmark

  • full memory mode (2080 MiB)
  • interpreted mode
  • hardware AES mode
  • large pages mode
    Initializing (1 thread) ...
    ERROR: Dataset allocation failed

root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit
RandomX benchmark

  • full memory mode (2080 MiB)
  • JIT compiled mode
  • hardware AES mode
  • small pages mode
    Initializing (1 thread) ...
    Memory initialized in 28.6958 s
    Initializing 1 virtual machine(s) ...
    Running benchmark (1000 nonces) ...
    Calculated result: 57eb2044a20f8d6d6ef75eba6d879af5b9850556cb5a7cb459d7b091bd8d63aa
    Reference result: 57eb2044a20f8d6d6ef75eba6d879af5b9850556cb5a7cb459d7b091bd8d63aa
    Performance: 396.985 hashes per second

root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit --threads 40 --init 64
RandomX benchmark

  • full memory mode (2080 MiB)
  • JIT compiled mode
  • hardware AES mode
  • small pages mode
    Initializing (64 threads) ...
    Memory initialized in 3.993 s
    Initializing 40 virtual machine(s) ...
    Running benchmark (1000 nonces) ...
    Calculated result: 57eb2044a20f8d6d6ef75eba6d879af5b9850556cb5a7cb459d7b091bd8d63aa
    Reference result: 57eb2044a20f8d6d6ef75eba6d879af5b9850556cb5a7cb459d7b091bd8d63aa
    Performance: 2440.3 hashes per second

root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit --threads 40 --init 64 --nonces 1000000
RandomX benchmark

  • full memory mode (2080 MiB)
  • JIT compiled mode
  • hardware AES mode
  • small pages mode
    Initializing (64 threads) ...
    Memory initialized in 4.16322 s
    Initializing 40 virtual machine(s) ...
    Running benchmark (1000000 nonces) ...
    Calculated result: 70236ad5491adcfbb5d3136601b8dcf8859ab7f061c35b5bff58e7b04da32de0
    Performance: 6118.03 hashes per second

root@DELL-R820-E5-03:~/randomx/RandomX/bin# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz
Stepping: 7
CPU MHz: 1388.062
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4803.03
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60
NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts

root@DELL-R820-E5-03:~/randomx/RandomX/bin# ./benchmark --mine --jit --largePages --threads 40 --init 64 --nonces 1000000
RandomX benchmark

  • full memory mode (2080 MiB)
  • JIT compiled mode
  • hardware AES mode
  • large pages mode
    Initializing (64 threads) ...
    ERROR: Dataset allocation failed

root@DELL-R820-E5-03:~/randomx/RandomX/bin# uname -a
Linux DELL-R820-E5-03 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

root@DELL-R820-E5-03:~/randomx/RandomX/bin# free -m
total used free shared buff/cache available
Mem: 15971 444 14863 9 663 15006
Swap: 16318 0 16318

root@DELL-R820-E5-03:/randomx/RandomX/bin# cat /proc/meminfo
MemTotal: 16354676 kB
MemFree: 15219988 kB
MemAvailable: 15367052 kB
Buffers: 36828 kB
Cached: 481796 kB
SwapCached: 0 kB
Active: 417548 kB
Inactive: 140912 kB
Active(anon): 42720 kB
Inactive(anon): 9168 kB
Active(file): 374828 kB
Inactive(file): 131744 kB
Unevictable: 3652 kB
Mlocked: 3652 kB
SwapTotal: 16710652 kB
SwapFree: 16710652 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 43520 kB
Mapped: 33532 kB
Shmem: 9632 kB
Slab: 160804 kB
SReclaimable: 62612 kB
SUnreclaim: 98192 kB
KernelStack: 10704 kB
PageTables: 2940 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 24756916 kB
Committed_AS: 481120 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 14336 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 128
HugePages_Free: 128
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 134252 kB
DirectMap2M: 6092800 kB
DirectMap1G: 12582912 kB
root@DELL-R820-E5-03:
/randomx/RandomX/bin#

Wrong dataset size in the documentation

Some parts of the documentation (README.md, design.md, dataset.md, vm.md) and the code (main.cpp, dataset.cpp) indicate that the dataset has a size of 4 GiB, but if I'm not mistaken it it now 2 GiB.

FPADD and FPSUB optimization

Floating point instructions FPADD and FPSUB become a no-op if the second operand is larger than about 1e+25 (the mantissa is not big enough to handle adding a small and a very big number).

This happens in about 22% of cases. and in about 7% of cases, both vector values are unchanged (the whole register). An ASIC could mark registers with large values and optimize out FPADD and FPSUB that use such register.

Possible solution: separate source and destination registers for FPADD/FPSUB and FPMUL/FPDIV. Large values will be possible only in the second group of registers, so all floating point operations will have an effect.

Big endian CPU support

Will you finish fixing the big endian problems?
I want to start porting the JIT to ppc64le, and ppc64be too if you fix it, so I'd like to know if RandomX will stay little endian only.

Optimized integer division

Currently, an average program contains 4 integer division instructions (out of 512).

I'm planning to increase it to 32 division instructions. 28 would be divisions by a runtime constant and 4 divisions by a runtime variable.

Integer division by a runtime constant can be optimized into a multiplication by a fixed-point reciprocal.

I have implemented this optimization in the JIT compiler and there is no measurable performance drop with 32 divisions per program. Without this optimization, performance drops by around 10% (Intel Sandy Bridge).

This attempts to force an ASIC to do some sort of compilation step or face 8 times more division instructions.

Potential division by zero in random_reciprocal()

Issue reported in the context of Kudelski Security's audit

Division by zero here if divisor is zero:

RandomX/src/reciprocal.c

Lines 46 to 50 in 1276d67

uint64_t randomx_reciprocal(uint64_t divisor) {
const uint64_t p2exp63 = 1ULL << 63;
uint64_t quotient = p2exp63 / divisor, remainder = p2exp63 % divisor;

I suggest to check for divisor == zero and return an appropriate value.

IDIV_C and ISDIV_C instructions

What's the point in having them? ISDIV_C compiles to a lot of code:

	; ISDIV_C r4, -692911499
	mov rax, -893288710803585809
	imul r12
	xor eax, eax
	sar rdx, 25
	sets al
	add rdx, rax
	add r12, rdx

And can be clearly made faster on ASIC. And since it's essentially multiplication by constant, maybe replace these two instructions with explicit multiplication by 64-bit constant? Then we won't need to handle all edge cases and the code would become something like this:

	; IMUL_C r4, -893288710803585809
	mov rax, -893288710803585809
	imul r12
	add r12, rdx

P.S. Since constants are only 32-bit, we can define this instruction as multiplication by reciprocal without further edge case handling:

a *= (2^64-1)/C if C != 0
a *= some fixed 64-bit value if C = 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.