We would like to increase performance in blake2s by moving the frequent parts into HW.

Accelerate blake2s with hardware support about tillitis-key1 HOT 7 CLOSED

secworks commented on June 11, 2024

Accelerate blake2s with hardware support

from tillitis-key1.

Comments (7)

secworks commented on June 11, 2024 1

Hi!
First, thanks for great comments and ideas!
Changing from byte by byte-copy to w32 copy is a really good idea. It asymptotically reduce the number of read-write ops by four. One needs to measure this of course, but should provide some performance benefit.

I've actually implemented three versions of HW acceleration for blake2s (BLAKE2s, really important according to J-P Aumasson): A full blake2s (https://github.com/secworks/blake2s), a core that perform the G-function, and the minimal version that was related to this issue. The minimal version basically accepts three operands and provides two results in parallel - which is pretty close to what you are suggesting with (a = (a + b) + c and a = (a ^ b) >>> k)

always @*
begin : b2_gf_logic
blake2s_res0 = blake2s_op0_reg + blake2s_op1_reg + blake2s_op2_reg;
blake2s_res1 = ~blake2s_op0_reg ^ blake2s_op1_reg ^ blake2s_op2_reg;
end

The second one (blake2s_res1) is incorrect. It should be a barrel shifter controlled by one of the three operands as k (for example blake2s_op0_reg). The reason for the code was just to see that we could fit even this amount of extra logic resources. Changing to a barrel shifter would eat up even more resources than inverse and xor. Unfortunately, even this very minimal acceleration does not fit in our highly allocated FPGA. I can get the resources, but the routing to get to the resources becomes so long that the clock frequency drops below our target 18 MHz.

The purpose of having two results from the same operands is to have fewer operand registers and muxes to select what operation to perform. There will still be a mux selecting which result to read out. The changes SW part of Blake2s would know which result would be valid and read out the appropriate result.

There is a coprocessor interface on the PicoRV32 (pcpi), which could potentially open up for not having to copy register contents to registers in a core. Unfortunately pcpi doesn't support three operands (since the PicoRV32 is a two operand, one result reg-reg machine). So using pcpi would still require the coprocessor to have hold registers and some control logic. One could potentially see a way to modify the CPU to support three operand operations, but the PicoRV32 doesn't lend itself easily to those more radical modifications. It would for example hit the register file implementation quite hard. The easy route for HW acceleration is MMIO connected cores.

Since I can't even fit these two operations, and we were running out of time for the Bellatrix release, I decided do close this issue and move the minimal blake2s acceleration work to a private repo for now. There are several things we would like to use the very few resources left in the FPGA, and acceleration of blake2s is not the highest priority. But if/when we have a FPGA with more resources I'm sure we will add blake2s HW in some form. One reason is that we would like to not have the Unique Device Secret in Embedded Block Ram, just registers. This would make evil maid-attack much harder. Having the blake2s state in discrete registers, and initialize them with the UDS would allow us to do that.

Regarding the performance - blake2s is used in a several ways not directly related to asymmetric operations. One thing we do every time an application is being loaded onto the TKey device, is to measure it loading it into the application RAM - measure means calculating a blake2s digest of the app together with the Unique Device Secret (UDS) and any User Supplied String (USS). If we could improve the performance of blake2s, we could potentially reduce the time it takes to load the application.

The random number generator app use blake2s in a Hash_DRBG construct to generate the random numbers (periodicallty seeded by the TRNG). And if used in the new signed randomness-mode, it will also calculate a blake2s digest for the random numbers provided to the host. Other applications may also be using the blake2s provided by the FW, so a performant blake2s is quite important.

from tillitis-key1.

secworks commented on June 11, 2024

The following branch contain a HW implementation of the Blake2s G-function. SW writes a, b, c, d, m0 and m1 variables and then read out the updated values for a, b, c, d.

https://github.com/tillitis/tillitis-key1/tree/blake2s_hw

The implementation is a hair too big and we end up with 103% utilization of LCs.

from tillitis-key1.

secworks commented on June 11, 2024

There is an even more minimal acceleration (that is also not 100% correct) for BLAKE2s in the following branch:
main...minimal_blake2s_hw

In this branch we basically accelerate those operations in the G function that takes more than two operators, or can be combined into being so. We calculate the result from two different operations in parallel, and SW is responsible for reading out the result it wants. This saves a control MUX. The implementation fits in the FPGA, but timing is < 18 MHz.

from tillitis-key1.

secworks commented on June 11, 2024

Will not be integrated in the TKey1 device.

from tillitis-key1.

LoupVaillant commented on June 11, 2024

Hi,

If I may there may be lower hanging fruits before we start to add actual hardware support. It’s something I’ve noticed with BLAKE2b on my out-of-order 64-bit laptop CPU, that might affect the small RISC-V core as well: loading bytes in the update function:

void blake2s_update(blake2s_ctx *ctx,
    const void *in, size_t inlen)       // data bytes
{
    for (size_t i = 0; i < inlen; i++) {
        // blake2s_compress() and stuff
        ctx->b[ctx->c++] = ((const uint8_t *) in)[i]; // THIS LINE MAY BE VERY SLOW
    }
}

On big out of order cores I noticed that loading bytes one by one takes up so much time that switching to loading word by word improved my performance by over 50%. Now this significantly complicated my own update function, whose current version looks like the following monster:

void crypto_blake2b_update(crypto_blake2b_ctx *ctx,
                           const u8 *message, size_t message_size)
{
	// Avoid undefined NULL pointer increments with empty messages
	if (message_size == 0) {
		return;
	}

	// Align with word boundaries
	if ((ctx->input_idx & 7) != 0) {
		size_t nb_bytes = MIN(gap(ctx->input_idx, 8), message_size);
		size_t word     = ctx->input_idx >> 3;
		size_t byte     = ctx->input_idx & 7;
		FOR (i, 0, nb_bytes) {
			ctx->input[word] |= (u64)message[i] << ((byte + i) << 3);
		}
		ctx->input_idx += nb_bytes;
		message        += nb_bytes;
		message_size   -= nb_bytes;
	}

	// Align with block boundaries (faster than byte by byte)
	if ((ctx->input_idx & 127) != 0) {
		size_t nb_words = MIN(gap(ctx->input_idx, 128), message_size) >> 3;
		load64_le_buf(ctx->input + (ctx->input_idx >> 3), message, nb_words);
		ctx->input_idx += nb_words << 3;
		message        += nb_words << 3;
		message_size   -= nb_words << 3;
	}

	// Process block by block
	size_t nb_blocks = message_size >> 7;
	FOR (i, 0, nb_blocks) {
		if (ctx->input_idx == 128) {
			blake2b_compress(ctx, 0);
		}
		load64_le_buf(ctx->input, message, 16);
		message += 128;
		ctx->input_idx = 128;
	}
	message_size &= 127;

	if (message_size != 0) {
		// Compress block & flush input buffer as needed
		if (ctx->input_idx == 128) {
			blake2b_compress(ctx, 0);
			ctx->input_idx = 0;
		}
		if (ctx->input_idx == 0) {
			ZERO(ctx->input, 16);
		}
		// Fill remaining words (faster than byte by byte)
		size_t nb_words = message_size >> 3;
		load64_le_buf(ctx->input, message, nb_words);
		ctx->input_idx += nb_words << 3;
		message        += nb_words << 3;
		message_size   -= nb_words << 3;

		// Fill remaining bytes
		FOR (i, 0, message_size) {
			size_t word = ctx->input_idx >> 3;
			size_t byte = ctx->input_idx & 7;
			ctx->input[word] |= (u64)message[i] << (byte << 3);
			ctx->input_idx++;
		}
	}
}

Now in the PicoRV32 core I don’t expect the speed up to be nearly as impressive, but I expect it costs little firmware ROM, and no FPGA gates at all. It might be worth trying out someday.

Now I understand actual hardware support may not be worth adding, but I have a couple ideas. How about, instead of implementing the full G function, we’d only implement tiny bits of it in a couple dedicated instructions? Recalling the G function’s main shape:

a += x;  a += b;  d ^= a;  d >>>= 16;
         c += d;  b ^= c;  b >>>= 12;
a += y;  a += b;  d ^= a;  d >>>=  8;
         c += d;  b ^= c;  b >>>=  7;

I see a couple candidates for macro-op fusion or dedicated opcodes (a, b, and c are registers, k is a constant).

a = (a + b) + c
a = (a ^ b) >>> k
a += b;  c ^= a;  c >>>= k

For the ones involving k we could have 4 dedicated opcodes instead of using the barrel shifter. For the last one I even suspect that’s the only way to make it work without lowering frequency.

Though if we’re being honest, the main bottleneck right now probably lies in asymmetric operations. If special hardware support were ever to be added the priority is probably bignum arithmetic, to speed up elliptic curves (and polynomial hashes). Of course there’s no way we’d ever add big multipliers on the puny ICE, I was thinking more of 32->64 bits MUL-ADD instructions, likely separated into MULADD and MULADDH[S] or like MUL and MULH.

Though now the only way to take advantage is for application writers to use those likely non-standard instructions deliberately, either by patching code generation, adding intrinsics, or writing assembly by hand. Probably best left to users of unlocked version.

from tillitis-key1.

LoupVaillant commented on June 11, 2024

Thanks for your detailed answer. I figured you had good reasons, but man that is due diligence.

If we could improve the performance of blake2s, we could potentially reduce the time it takes to load the application.

I don't know the relative performance of download and hashing, but I was thinking that perhaps the current hardware (including bitstream) allows parallel download and computation? If it does, interleaving hashing with download might speed things up a bit even if the actual performance of BLAKE2s isn't improved.

(Just one little regret here: programs come in 127 byte chunks, while they theoretically could have come in 128 byte chunks if we omitted the header just this once. It would have made optimising incremental hashing a teeny bit easier, since 128 bytes mean exactly two BLAKE2s blocks. Though it would also have broken your firmware protocol's orthogonality…)

from tillitis-key1.

secworks commented on June 11, 2024

Good point on the protocol, for some reason we wanted to have a total size of 128 bytes (including at least a one byte header). But we have also discussed having support for complete 128 blocks. And as you say, this would be a nice match for the hash block size.

from tillitis-key1.

Accelerate blake2s with hardware support about tillitis-key1 HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent