riscv / riscv-bitmanip Goto Github PK
View Code? Open in Web Editor NEWWorking draft of the proposed RISC-V Bitmanipulation extension
Home Page: https://jira.riscv.org/browse/RVG-122
License: Creative Commons Attribution 4.0 International
Working draft of the proposed RISC-V Bitmanipulation extension
Home Page: https://jira.riscv.org/browse/RVG-122
License: Creative Commons Attribution 4.0 International
This is a very general name. Other extensions will also want intrinsics files. Perhaps B extension intrinsics should be in a file with a name like rvb-intrin.h to make it clear that these are RISC-V B extension intrinsics.
If we are putting all intrinsics in one file, then we may need to conditionalize them based on whether that particular extension is enabled. That requires a macro as per issue #28.
I request that instruction CLMUL, standing for "carry-less multiply", be renamed to XMUL, for "XOR multiply", meaning a multiplication where the partial products are summed by bitwise XORs instead of the usual additions. My reason is simply that I find the name "carry-less multiply" to be awkward, and I'm probably not alone. The name "carry-less multiply" appears not to be entrenched except in connection to x86 processors.
I attempted a search, and as far as I can tell, the "CLMUL" name has been adopted as part of a standard ISA only for the x86 (with SIMD instruction PCLMULQDQ). The B extension draft notes that the equivalent SPARC instruction is XMULX, officially documented as "XOR multiply". Most Web references to "carry-less multiply", "carry-less multiplication", and "carry-less product" seem to point back one way or another to Intel's CLMUL instruction. Also, there appears as yet to be no __builtin_clmul
in GCC. The path should therefore be clear for us to choose the name XMUL if others agree with me it would be preferable.
In rvb_simple EU:
When execute sh1addu.w shadd_active == 1 and wuw_active == 1 at the same time.
So rd = shadd_out | wuw_dout.
wuw_dout == 1 because din_insn14 == 0. So din_insn14 == 1 in sh2addu.w and sh3addu.w and they work.
@cliffordwolf
Hi Clifford
In the spec the pseudo code for SROW is as follows
int shamt = rs2 & (XLEN - 1);
return ~(~rs1 >> shamt);
in the case of SROW, is XLEN the length of the machine (64 bit) or the target of the operation (32 bit)
in other words if
rs2 = 0xFFFF_FFFF_FFFF_FFFF
is the shift amount 31 or 63 ?
The base ISA defines ADDW and SUBW like so:
ADDW and SUBW are RV64I-only instructions that are defined analogously to ADD and SUB but operate on 32-bit values and produce signed 32-bit results. Overflows are ignored, and the low 32-bits of the result is sign-extended to 64-bits and written to the destination register.
The Bitmanip extension refers to clzw, ctzw, pcntw, etc. but doesn't actually define how they work. The pseudocode is only defined for the instructions that operate on data of size XLEN.
Sophisticated readers understand that the *w instructions in Bitmanip are almost certainly supposed to behave like the *W instructions in the base architecture (operating on 32-bit data and then sign-extending the result to XLEN). However, this should be explicitly addressed with some verbiage and pseudocode so that everything is fully specified and unambiguous.
Hi all,
In the proposed patch for bitmanip Zbs family of instructions found here, the format specifier for sbclri
doesn't work right because we're expecting a const_int with one bit low.
So, in order for `ctz_hwi' to work right, we need to invert the operand first, and for that we need a separate format specifier (or amend the const_int opcode in-place in the .md file, but I chose to do the former.)
The patch:
gcc-b-support-fix-sbclri.patch.txt
Testcase:
unsigned int
f(unsigned int a)
{
return a & ~(1 << 29);
}
Before:
f:
sbclri a0,a0,0
ret
After:
f:
sbclri a0,a0,29
ret
Should
riscv-bitmanip/texsrc/bext.tex
Line 1176 in 24df418
In fact read "For any values of A, B and C"?
Continuation of off-topic discussion in #10.
Quick summary:
[T]here's a whole category of instructions that would have more impact but aren't currently included, and that is the unsigned equivalents of the existing RV64I *W instructions: ADDWU, SUBWU, SLLWU, etc. These would be just like the existing *W instructions but instead zeroing the upper 32 bits, as appropriate for an unsigned int or uint32_t result type rather than int or int32_t. We hardly need to run any experiments to know that such *WU instructions would be used far more frequently than the *W instructions proposed for the B extension.
The idea behind *W instructions is simply that they operate on the lower 32 bits. It makes sense to have a consistent scheme for how to fill the upper bits, but it doesn't matter much what this scheme is exactly, if that scheme is sign-extend or zero-extend.
If you provide *W operations you have to decide what to do with the upper bits. You can leave then alone (x86), zero extend them (Aarch64), or sign extend them (RISC-V). If you leave them alone then casting a 32 bit value to a 64 bit value requires a sext or zext every time. If you sign extend them then only unsigned values require a zext, signed ones are already correct. If you zero extend them then only signed values require a sext, unsigned ones are already correct. It's hard to say which is better. Most normal application code uses signed more than unsigned, favouring sign extension
Andrew has told me he wants to keep open the option for *WU instructions for now, which implies to me we should devise a system to reserve the encoding space now, even if the idea eventually gets dropped. I haven't run this particular system by him yet, but I intend to do so soon.
Naturally, we would want any system that's adopted to be fully compatible with the B extension, by tweaking either or both as necessary. I'll be looking into this question soon. And now you or anyone else can do so too, if you're so inclined.
But this is getting off-topic for this GitHub issue, so we should move the discussion elsewhere if you'd like to continue.
I agree that there is merit in systematically adding *WU versions of R-type instructions that have *W versions. If you do it at all then it should be for BOTH the base instruction set and for any new *W instructions in the BitManip extension.
This is easily affordable in terms of opcode space by, as has been pointed out, using something in the hi bits of the instruction, keeping the identical opcode and func3 to the *W version. It's also very cheap to implement.
Probably the only OP-IMM-32 instruction that can be justified is ADDIWU. That does need a new opcode.
Interestingly, in a message to me yesterday where we talked about possible *WU instructions, Andrew literally wrote:
I’m not opposed to putting them in B.
To be sure, that expresses ambivalence more than advocacy. But as far as "could be persuaded" goes, I believe yes.
I hope everyone feels like this summary treats them fairly. Please post your corrections below.
Hi all,
I have put together a small patchset which implements command-line bitmanip ISA subset selection. I've tried to follow the current bitmanip draft spec as closely as possible. In particular, the 'B' bitmanip subset is taken as the one indicated by the extended dotted line (everything excluding Zbt, Zbf.)
Please also read the "Problems" section at the bottom.
The bitmanip spec currently defines 9 subgroups of instructions. I defined them as target flags residing in a new target variable called "x_riscv_bitmanip_flags", each called accordingly:
OPTION_MASK_BITMANIP_ZBB
OPTION_MASK_BITMANIP_ZBC
OPTION_MASK_BITMANIP_ZBE
OPTION_MASK_BITMANIP_ZBF
OPTION_MASK_BITMANIP_ZBM
OPTION_MASK_BITMANIP_ZBP
OPTION_MASK_BITMANIP_ZBR
OPTION_MASK_BITMANIP_ZBS
OPTION_MASK_BITMANIP_ZBT
When invoking gcc as follows:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c
the flag states become
ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1 1 1 0 1 1 1 1 0 1
If the user provides atleast one sub-ISA specifier, then only the sub-ISA flags are honoured. e.g.:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib_Zbb -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c
will set
ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1 0 0 0 0 0 0 0 0 1
and all following sub-ISA specifiers are simply additive, e.g.:
$ riscv32-unknown-elf-gcc -O2 -march=rv32ib_Zbb_Zbf_Zbt -mabi=ilp32 -S -o gcc-demo.s gcc-demo.c
will set
ZBB, ZBC, ZBE, ZBF, ZBM, ZBP, ZBR, ZBS, ZBT, MASK_BITMANIP
1 0 0 1 0 0 0 0 1 1
and so forth. The user must provide the 'b' keyword before adding (if any) "ZbX" directives, and the "ZbX" directives must always follow directly after the 'b' directive. The "ZbX" directives must always have atleast one set of underscores surrounding them. If there are multiple "ZbX" directives, they must come one after the other.
The "MASK_BITMANIP" target macro is still there, but it is not used in the bitmanip.md conditions.
There are 3 patches. Each can be applied without breaking the build, but they must come in the right order.
Patch 1: add riscv.opt "riscv_bitmanip_flags" variable, and associated masks.
Patch 2: modify bitmanip.md insns to only get generated if the associated subset mask is set.
Patch 3: modify riscv-common.c to accept "ZbX" form of subset ISA flags.
I've also provided a patch that applies all of them at once.
There is almost certainly some fiddling to be done with the riscv.c file aswell, and I suspect there to be a few corner cases in the parser, but this is useful as it is and would like to see what people think of the current implementation.
Output assembly arch attributes look like this:
.attribute arch, "rv32i2p0_b2p0_Zbb2p0_Zbc2p0_Zbt2p0_Zbp2p0"
Patches:
0001-add-bmi-subisa-march-opts.patch.txt
0002-add-bmi-subisa-march-bitmanip.patch.txt
0003-add-bmi-subisa-march-common.patch.txt
add-bmi-subisa-march-all.patch.txt
Problems:
No order of flags is currently enforced. What is the canonical order of the ZbX flags? Alphabetical? Or something else?
Unfortunately, GCC refuses to create target masks relative to a specified variable if a flag name is not provided. I am referring to the riscv.opt file.
Take for example the ZBB directive:
...
mbmi-zbb
Target Mask(BITMANIP_ZBB) Var(riscv_bitmanip_flags)
Support the base subset of the Bitmanip extension.
...
This causes the following code to be generated in build/gcc/options.h
:
#define OPTION_MASK_BITMANIP_ZBB (HOST_WIDE_INT_1U << 0) // <<<<<<<<<<<<<<<<
#define OPTION_MASK_BITMANIP_ZBC (HOST_WIDE_INT_1U << 1)
#define OPTION_MASK_BITMANIP_ZBE (HOST_WIDE_INT_1U << 2)
#define OPTION_MASK_BITMANIP_ZBF (HOST_WIDE_INT_1U << 3)
#define OPTION_MASK_BITMANIP_ZBM (HOST_WIDE_INT_1U << 4)
#define OPTION_MASK_BITMANIP_ZBP (HOST_WIDE_INT_1U << 5)
#define OPTION_MASK_BITMANIP_ZBR (HOST_WIDE_INT_1U << 6)
#define OPTION_MASK_BITMANIP_ZBS (HOST_WIDE_INT_1U << 7)
#define OPTION_MASK_BITMANIP_ZBT (HOST_WIDE_INT_1U << 8)
#define MASK_DIV (1U << 0)
#define MASK_EXPLICIT_RELOCS (1U << 1)
#define MASK_FDIV (1U << 2)
#define MASK_SAVE_RESTORE (1U << 3)
#define MASK_STRICT_ALIGN (1U << 4)
#define MASK_64BIT (1U << 5)
#define MASK_ATOMIC (1U << 6)
#define MASK_BITMANIP (1U << 7)
#define MASK_DOUBLE_FLOAT (1U << 8)
#define MASK_HARD_FLOAT (1U << 9)
#define MASK_MUL (1U << 10)
#define MASK_RVC (1U << 11)
#define MASK_RVE (1U << 12)
We probably don't want to expose this dual method of specifying subisas, so let's try removing the -mbmi-zbb
name from riscv.opt
:
...
Target Mask(BITMANIP_ZBB) Var(riscv_bitmanip_flags)
...
This causes the following output in build/gcc/options.h
:
#define OPTION_MASK_BITMANIP_ZBC (HOST_WIDE_INT_1U << 0)
#define OPTION_MASK_BITMANIP_ZBE (HOST_WIDE_INT_1U << 1)
#define OPTION_MASK_BITMANIP_ZBF (HOST_WIDE_INT_1U << 2)
#define OPTION_MASK_BITMANIP_ZBM (HOST_WIDE_INT_1U << 3)
#define OPTION_MASK_BITMANIP_ZBP (HOST_WIDE_INT_1U << 4)
#define OPTION_MASK_BITMANIP_ZBR (HOST_WIDE_INT_1U << 5)
#define OPTION_MASK_BITMANIP_ZBS (HOST_WIDE_INT_1U << 6)
#define OPTION_MASK_BITMANIP_ZBT (HOST_WIDE_INT_1U << 7)
#define MASK_DIV (1U << 0)
#define MASK_EXPLICIT_RELOCS (1U << 1)
#define MASK_FDIV (1U << 2)
#define MASK_SAVE_RESTORE (1U << 3)
#define MASK_STRICT_ALIGN (1U << 4)
#define MASK_64BIT (1U << 5)
#define MASK_ATOMIC (1U << 6)
#define MASK_BITMANIP (1U << 7)
#define MASK_DOUBLE_FLOAT (1U << 8)
#define MASK_HARD_FLOAT (1U << 9)
#define MASK_MUL (1U << 10)
#define MASK_RVC (1U << 11)
#define MASK_RVE (1U << 12)
#define MASK_BITMANIP_ZBB (1U << 13) // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
It's now relative to the general-purpose target_flags
variable, rather than the riscv_bitmanip_flags
. Is there is some magic combination of GCC option properties to get around this?
I'm trying to compile examples with the compiled toolchain from the riscv repository. However, I could not find the said file in any repository here. From where does it come from?
The document says
slliu.w
is identlical toslli
, except that bits XLEN-1:32 of thers1
argument are cleared before the shift.
However, the proposed encoding for SLLIU.W shows another difference: Unlike the RV64 SLLI, the shift immediate for SLLIU.W is only 5 bits, supporting a maximum shift of 31 bits. If this was intentional, the limitation should be explained. Otherwise, it looks like the encoding will need to be changed to accomodate a 6-bit shift for RV64 and, I presume, a 7-bit shift eventually for RV128.
@cliffordwolf
Hi Clifford
I note that c.brev and c.not have been removed from the spec, but they still appear in the cproofs/insns.h
is this an oversight, or intentional as it is a pseudo-op
Thx
Lee
@cliffordwolf
Hi Clifford, could you please take a look at the pull request I made a week ago to review whether it can be merged or I need to make some changes ?
Many Thx
Lee
Is it the intention that orc16.w
is included in the Zbb
subset, or is it just orc16
?
Thanks.
I just committed a patch upstream to optimize a zero-extend followed by an array indexing left shift. This was sometimes three instructions, now it is two. This added a pattern zero_extendsidi2_shifted that is identical to the bitmanip branch slliuw pattern, except that it splits into two shifts instead of emitting an slliuw instruction. In order for the slliuw pattern to continue working, this upstream pattern will need a ! TARGET_BITMANIP check added to its condition. This issue is just to document the problem for when we eventually rebase later.
riscvarchive/riscv-gcc#187
no execution testing on simulators, only compilation testing
There are some INSTW or INST.W instructions, such as:
riscv-bitmanip/texsrc/bext.tex
Line 224 in a05231d
riscv-bitmanip/texsrc/bext.tex
Line 1686 in a05231d
W instructions are mean to keep 32bit computations on RV64 machine, is this property valid?
In the current bitmanip specs, these two properties seem not be valid, for example, shnaddu.w, adduw, subuw returns the 64 bit results not the sign-extended lower 32bit.
@cliffordwolf
Hi Clifford,
I have a query regarding the documented behavior for fsl, the text says that
The fsl rd, rs1, rs2, rs3 instruction creates a 2 x XLEN word by concatenating rs1 and rs3
(with rs1 in the MSB half)
from the pseudo code, it looks as though rs1 is in the lower half.
can you clarify which is correct ?
Thx
Lee
This might be useful:
VLU(...) is a little-endian variable-length integer coding that prefixes data bits with unary code length bits. The length is recovered by counting the least significant set bits, which encode a count of n-bit basic units. The data bits compactly trail the unary code prefix.
clz
ctz
With an 8 bit basic unit, the encoded size is similar to LEB128; 7-bits can be stored in 1 byte, 56-bits in 8 bytes and 112-bits in 16 bytes. Decoding, however, is significantly faster than LEB128, as it is not necessary to check for continuation bits every byte, instead the length can be decoded in a single count bits operation.
While VLU is not in major use, it could be substituted where LEB128 is used with reasonably significant benefits depending on the frequency of variable length fields. LEB128 probably performs similarly to VLU on a machine without bit scan forward and reverse. There are also potential SIMD or vector optimisations. For example, a decoder could have a predictor, and switch from a "per field" mode to a set of optimized modes. e.g. 128-bit SIMD code for parallel decoding of 16 x 7-bit fields.
There is symmetry in the encode and decode, with clz
for figuring out the size of a word, and ctz
to read the /prefix/ from the little-end. The code is a pretty good example of why little-endian makes more sense. The benchmarks currently perform decoding of 8-bit through to 56-bit and there is an optimized decoder for x86-64 BMI. I am investigating x86 SIMD and want to add support for big numbers. 112-bits and >= 128-bits.
The long-established name of the logic operation that ANDs two Boolean inputs and complements the result is NAND, not CAND. RISC-V assembly language has a pseudo-instruction for a bitwise complement called NOT, not COMPL or whatever. This draft extension includes instructions called NAND, NOR, and C.NOT. For consistency, shouldn't the instruction that computes a bitwise AND with the complement of the second operand be called ANDN instead of ANDC?
The gcc patch should define a preprocessor macro so end users can check to see if bitmanip support is enabled for the target. I would suggest __riscv_bitmanip since the rest of the code seems to be using bitmanip consistently.
See also the riscv/riscv-c-api-doc repo where we are documenting C API issues like preprocessor macros. I'm filing a pull request there to suggest __riscv_bitmanip which can be changed if someone has a better suggestion.
Hello,
The compilation of source foo.c (see below) fails with internal error when the compiler is run as follows:
$ riscv64-unknown-elf-gcc -O2 -march=rv64ib -mabi=lp64 -S -o foo.S foo.c
during RTL pass: combine
foo.c: In function ‘foo’:
foo.c:4:1: internal compiler error: in decompose, at rtl.h:2279
Source code foo.c:
int foo(int n)
{
return n + 0x7fffffff;
}
Additional notes:
gcc -v log:
[user@s01 bug]$ ../riscv64b/bin/riscv64-unknown-elf-gcc -Os -march=rv64ib -mabi=lp64 -v -S -o foo.S foo.c
Using built-in specs.
COLLECT_GCC=../riscv64b/bin/riscv64-unknown-elf-gcc
Target: riscv64-unknown-elf
Configured with: ../riscv-gcc/configure --prefix=/home/user/riscv-bitmanip/riscv64b --target=riscv64-unknown-elf --enable-languages=c --disable-libssp
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 10.0.0 20190929 (experimental) (GCC)
COLLECT_GCC_OPTIONS='-Os' '-march=rv64ib' '-mabi=lp64' '-v' '-S' '-o' 'foo.S'
/home/user/riscv-bitmanip/riscv64b/libexec/gcc/riscv64-unknown-elf/10.0.0/cc1 -quiet -v foo.c -quiet -dumpbase foo.c -march=rv64ib -mabi=lp64 -auxbase-strip foo.S -Os -version -o foo.S
GNU C17 (GCC) version 10.0.0 20190929 (experimental) (riscv64-unknown-elf)
compiled by GNU C version 4.8.5 20150623 (Red Hat 4.8.5-39), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
ignoring nonexistent directory "/home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/../../../../riscv64-unknown-elf/sys-include"
#include "..." search starts here:
#include <...> search starts here:
/home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/include
/home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/include-fixed
/home/user/riscv-bitmanip/riscv64b/lib/gcc/riscv64-unknown-elf/10.0.0/../../../../riscv64-unknown-elf/include
End of search list.
GNU C17 (GCC) version 10.0.0 20190929 (experimental) (riscv64-unknown-elf)
compiled by GNU C version 4.8.5 20150623 (Red Hat 4.8.5-39), GMP version 6.0.0, MPFR version 3.1.1, MPC version 1.0.1, isl version none
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: bff2308ac495c30be0d25ad6caff4627
during RTL pass: combine
foo.c: In function ‘foo’:
foo.c:4:1: internal compiler error: in decompose, at rtl.h:2279
4 | }
| ^
0x556a3e wi::int_traits<std::pair<rtx_def*, machine_mode> >::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> const&)
../../riscv-gcc/gcc/rtl.h:2277
0xbd4961 wi::int_traits<std::pair<rtx_def*, machine_mode> >::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode> const&)
../../riscv-gcc/gcc/wide-int.h:3102
0xbd4961 wide_int_ref_storage<std::pair<rtx_def*, machine_mode> >
../../riscv-gcc/gcc/wide-int.h:1032
0xbd4961 generic_wide_int<std::pair<rtx_def*, machine_mode> >
../../riscv-gcc/gcc/wide-int.h:790
0xbd4961 add<std::pair<rtx_def*, machine_mode>, std::pair<rtx_def*, machine_mode> >
../../riscv-gcc/gcc/wide-int.h:2422
0xbd4961 simplify_const_binary_operation(rtx_code, machine_mode, rtx_def*, rtx_def*)
../../riscv-gcc/gcc/simplify-rtx.c:4318
0xbd9cde simplify_binary_operation(rtx_code, machine_mode, rtx_def*, rtx_def*)
../../riscv-gcc/gcc/simplify-rtx.c:2156
0x1227a71 combine_simplify_rtx
../../riscv-gcc/gcc/combine.c:5804
0x122a492 subst
../../riscv-gcc/gcc/combine.c:5726
0x122a108 subst
../../riscv-gcc/gcc/combine.c:5667
0x122c4ed try_combine
../../riscv-gcc/gcc/combine.c:3422
0x12323d8 combine_instructions
../../riscv-gcc/gcc/combine.c:1305
0x12323d8 rest_of_handle_combine
../../riscv-gcc/gcc/combine.c:15066
0x12323d8 execute
../../riscv-gcc/gcc/combine.c:15111
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
System info: CentOS 7, uname -a: Linux XXXXXXXXX 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
It starts off by listing both grev ... rs2 and grevi ... imm.
Then the first paragraph says "It takes in a single register value and an immediate ..." which conflicts with the above and suggests the last operand must be an immediate.
Then the second paragraph says "This operation iteratively checks each bit i in rs2 ..." which suggests that the last operand must be a register.
Is there a recipe available to compile g++ with support for bitmanip extension? I tried adding c++ to the --enable-languages parameter but that did not help.
I need a small clarification. The description for cmix instruction says that -
It is equivalent to the following sequence.
and rd, rs1, rs2
andn t0, rs3, rs2
or rd, rd, t0
Is it implied that the register t0 will be modified as a result of the execution of this instruction?
The new "prefix zero-extend" instructions currently have names that don't follow the existing convention for *W instructions. In particular, these new instructions do not act at all like instructions DIVUW and REMUW, which perform their operation on two unsigned 32-bit values and then sign-extend the 32-bit result.
To avoid confusion, the new instructions need different names. I propose
ADDZX
SUBZX
SLLZXI
where "ZX" stands for "zero-extend". For example, instruction
SLLZXI rd,rs1,i
acts the same as the sequence
ZEXT.W rd,rs1
SLLI rd,rd,i
Reflecting on https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/0emw3Y8ZNxY I'd like to propose two new instructions for packing structs of bitfields and bytes.
Note that the RISC-V calling convention requires structs that fit in a register to be passed in a register when passed by value. This means that in the worst case we need to pack that register on each function call.
First, a REPACK[I] instruction (in Zbf) with the following semantic would help improve the performance of bitfield packing.
uint_xlen_t repack(uint_xlen_t rs1, uint_xlen_t rs2)
{
int shamt = rs2 & (XLEN-1);
uint_xlen_t lower = (rs1 << XLEN/2) >> XLEN/2;
uint_xlen_t upper = (rs2 >> XLEN/2) << shamt;
uint_xlen_t mask = ~uint_xlen_t(0) << shamt;
return (upper & mask ) | (lower & ~mask);
}
That is, take the upper half of rs1, and place it over the lower half of rs1 at the offset specified at rs2. This could likely re-use most of the circuitry for BFP, the other instruction in Zbf.
Packing N
data registers D0,D1,..,D(N-1)
into a bit field, using the lengths L0,L1,..,L(N-1)
:
PACK a0,D0,D1
REPACKI a0,a0,L0
PACK a0,a0,D2
REPACKI a0,a0,L1+L2
...
PACK a0,a0,D(N-1)
REPACKI a0,a0,L1+L2+...+L(N-2)
A word with N
bitfields can be packed in 2*(N-1)
instructions this way, when L1+L1+...+L(N-2) < XLEN/2
and L(i) <= XLEN/2 forall i in 0..(N-1)
. Only a few extra instructions are needed to stitch together the larger pieces in the remaining cases.
The main difference in use-case between REPACK and BFP is that the former is primarily useful for constructing a new struct of bitfields from its members whereas the latter is primarily useful when overwriting one particular bitfield in an such an existing struct, usually as part of a read-modify-write pattern.
A Pack Bytes (PACKB) instruction in Zbb would help to pack structs of bytes.
uint_xlen_t packb(uint_xlen_t rs1, uint_xlen_t rs2)
{
return (rs1&255) | ((rs2&255)<<8);
}
This would allow packing of 4 bytes into a 32-bit word in 3 instructions instead of 5 and would only require "Zbb" (that is it would not require SHFL, unlike the 5 instruction solution):
PACKB a0, a0, a1
PACKB a1, a2, a3
PACK[W] a0, a0, a1
Moving this discussion here from
https://groups.google.com/a/groups.riscv.org/forum/?utm_medium=email&utm_source=footer#!msg/isa-dev/0emw3Y8ZNxY/eUT5_IzaAwAJ.
The proposal is to add dedicated sext.h
and sext.b
instructions.
uint_xlen_t sext_h(uint_xlen_t rs)
{
int shamt = XLEN - 16;
return sra(sll(rs, shamt), shamt);
}
uint_xlen_t sext_b(uint_xlen_t rs)
{
int shamt = XLEN - 8;
return sra(sll(rs, shamt), shamt);
}
The encoding cost would be minimal because these are unary instructions.
The hardware cost would be acceptable, if the instruction is being used.
The main argument for this instruction is that the RISC-V calling convention requires arguments < 32 bit to be sign/zero extended according to their type. For example:
extern "C" int foo(short);
int bar(int a, int b) {
return foo(a+b);
}
Is being compiled to the following without B extensions:
bar(int, int):
addw a0,a0,a1
slliw a0,a0,16
sraiw a0,a0,16
tail foo
And could be compiled to the following with sext.w
:
bar(int, int):
addw a0,a0,a1
sext.h a0,a0
tail foo
The expectation is that function arguments < 32 bit may be common in code that is ported to RISC-V from smaller 8-bit or 16-bit micro controllers.
With those instructions added we would be able to zero-extend or sign-extend any 8-, 16-, or 32-bit value in a single instruction:
Width | sign ext | zero ext |
---|---|---|
8 | sext.b rd,rs | andi rd,rs,255 |
16 | sext.h rd,rs | pack[w] rd,rs,zero |
32 | addw rd,rs,zero | pack rd,rs,zero |
Hello,
Currently bswapdi2 is defined in bitmanip.md as follows:
(define_insn "bswapdi2"
[(set (match_operand:SI 0 "register_operand" "=r")
(bswap:SI (match_operand:SI 1 "register_operand" "r")))]
"TARGET_64BIT && TARGET_BITMANIP"
"grevi\t%0,%1,0x38"
[(set_attr "type" "bitmanip")])
Looks like SI should be replaced with DI in this definition.
My test was a function listed below (originally defined in libgcc/libgcc2.c):
typedef long DItype;
DItype
__bswapdi2 (DItype u)
{
return ((((u) & 0xff00000000000000ull) >> 56)
| (((u) & 0x00ff000000000000ull) >> 40)
| (((u) & 0x0000ff0000000000ull) >> 24)
| (((u) & 0x000000ff00000000ull) >> 8)
| (((u) & 0x00000000ff000000ull) << 8)
| (((u) & 0x0000000000ff0000ull) << 24)
| (((u) & 0x000000000000ff00ull) << 40)
| (((u) & 0x00000000000000ffull) << 56));
}
Before replacement of SI with DI the function assembly was broken:
__bswapdi2:
addi sp,sp,-16
sd ra,8(sp)
call __bswapdi2 /// <-- infinite recursion
ld ra,8(sp)
addi sp,sp,16
jr ra
After replacement it looks good:
__bswapdi2:
grevi a0,a0,0x38
ret
Please, make a fix.
The pseudo-instructions defined for GREVI (BREV.P, PSWAP.N, etc.) are like the ZIP and UNZIP pseudo-instructions in that they move "units" of a power-of-two size within "components" of a larger power-of-two size. For the ZIP/UNZIP pseudo-instructions, there is a simple pattern of
ZIP<unit-size><component-suffix>
UNZIP<unit-size><component-suffix>
In this system, <unit-size> is either empty, meaning 1 bit, or is a decimal number of bits ("2", "4", "8", or "16"); and <component-suffix> is either empty, meaning the full register size (XLEN), or is one of the suffixes '.N', '.B', '.H', or '.W'.
It would aid comprehension if the GREVI pseudo-instructions followed the same system. I propose
REV<unit-size><component-suffix>
This would rename all of the GREVI pseudo-instructions as follows:
BREV.P -> REV.P
PSWAP.N -> REV2.N
BREV.N -> REV.N
NSWAP.B -> REV4.B
PSWAP.B -> REV2.B
BREV.B -> REV.B
BSWAP.H -> REV8.H
NSWAP.H -> REV4.H
PSWAP.H -> REV2.H
BREV.H -> REV.H
HSWAP.W -> REV16.W
BSWAP.W -> REV8.W
NSWAP.W -> REV4.W
PSWAP.W -> REV2.W
BREV.W -> REV.W
WSWAP -> REV32
HSWAP -> REV16
BSWAP -> REV8
NSWAP -> REV4
PSWAP -> REV2
BREV -> REV
If the name "BSWAP" is so entrenched that we feel we must have this mnemonic, then BSWAP can be made another pseudo-instruction alias for REV8.
@cliffordwolf
Hi Clifford
for an XLEN=32 or XLEN=64 Implementation, should the following instruction raise an exception ?
sbseti x1, x2, 127
Thx
Lee
Hi,
in the opcode encodings table, FSRI shows that bit 26 needs to be 1. Can we add to the table that SBEXTI, GORCI, GREVI, RORI, SROI, SRAI, SRLI bit 26 needs to be 0? Otherwise it looks like there is overlap in the encodings. I see the text above the table mentions op[26]=1 selects funnel shifts, but it might be helpful to show in the table as well.
Thanks,
Dan
As instruction BFP is currently defined, the bit field it overlays in the rs1 value may wrap around to span both the high (most-significant) and low (least-significant) ends of the result, For example, this sequence,
li t0,12<<24|26<<16|0xABCD
bfp t1,zero,t0
for RV32, leaves t1
with the value 0x340002AF
, because the 16-bit value 0xABCD
shifted left 26 bits (without clipping) is 0x2AF34000000
, and this value wraps around from high bits to low bits in the result.
Are there expected advantages to this wrapping? My analysis indicates that the hardware for BFP (and for the B extension generally) can be a little reduced by not defining BFP to wrap around this way. The basic reason is because BFP requires the hardware to separately create a mask in addition to shifting (or rotating) rs2[15:0], and forcing this mask to wrap around adds a little extra circuitry.
I can imagine some applications might benefit from wrapping around, while others benefit more from not wrapping around. If there are good reasons to prefer wrap-around, I suggest adding an explanation of that choice to the document. If not, I propose modifying the specified behavior to use shifts instead of rotations, like so:
uint_xlen_t bfp(uint_xlen_t rs1, uint_xlen_t rs2)
{
int len = (rs2 >> 24) & 15;
int off = (rs2 >> 16) & (XLEN-1);
len = len ? len : 16;
uint_xlen_t mask = slo(0, len) << off;
uint_xlen_t data = rs2 << off;
return (data & mask) | (rs1 & ~mask);
}
For my example above, the value left in t1
would then be 0x34000000
.
(Note that, the hardware could still quietly substitute
uint_xlen_t data = rol(rs2, off);
for computing data
without changing the behavior, if that's more convenient. The issue is just with the rotation of the mask.)
This "OP Quadrant" proposal below has global implications for RISC V instruction encoding, but I propose it here in BitManip, as this is the first extension (other than 'M') to need such organisation within func7. (func7 & func3 have the usual meaning for the R-type instruction format).
Note: the RISC V user ISA spec explicitly states that RV128 may introduce new 128 bit instructions into an OP128 major opcode, which is the reverse of what happened for RV64. I assume this will be the case, in discussion below.
"Contiguous" reserved opcode space is a precious resource. RISC V has only three reserved major opcodes left for future standard extensions.
Up to now, the only values of func7 for instructions within the OP-INT family of major opcodes are 0b0000000, 0b0000001 (MUL/DIV), and 0b0100000 (SUB/SRA). Bitmanip will substantially expand the usage of func7 values. It is important this is done in a rational way, as func7 values chosen within OP will also have major side effects on OP-IM, OP32[IM], and OP128[IM].
Within OP32 and OP128, up to 50% of these major opcodes are available as contiguous reserved space (for func3 values = 0bX1X (where X = 0 or 1), ie: do not correspond to any "Q" or "W" instruction. Care needs to be taken not to punch "holes" into this space. (Unfortunately, two "M" instructions break this rule in OP32, reducing OP32 continuous free opcode space slightly)
The current BitManip v0.90 encoding proposal are bit problematic in this regard as it punches "holes" into "non-W" sections of OP32. These non-W sections otherwise form part of an unused 50% of OP32/OP128, and scattered holes within them will limit the long term usefulness for other future extensions. (An example of a "hole" created in OP32 is BDEPW, which has a func3 value of 0b010).
BitManip v0.90 also unnecessarily introduces a new two source R-type format specifically for one instruction, FSRI, which moves the rs2 register field to a new position. This will complicate implementation of superscalar out-of-order microarchitectures, and breaks the existing RISC V approach of keeping rs1 and rs2 in the same positions for every relevant instruction.
The choice of 4x32 value Quadrants is not an arbitrary choice. It is in fact fundamental to the organisation of RV32, RV64 and RV128.
There is a 32 x value func7 constraint for I-type shift instructions with a 7 bit (RV128) immediate fields. For RV64, I-type shift instructions have a 6 bit immediate field and can encode 64 values in their remaining instruction bits, hence translating into a 64 x value func7 constraint. (Hence Quadrants A & B need to be created for these distinct 2x32 value subsets of func7).
Also, dividing up func7 into Quadrants is natural for ternary instructions, as blocks of 32 x func7 values are needed to introduce an "rs3" instruction format (hence Quadrant "D" needs to be created for such rs3-type instructions).
Below is an outline of how func7 should be structured into Quadrants A-D, based on the last two bit values of func7 (shown below as ' | 00' to ' | 11' ):
Quadrant A1 (n=1): instructions with func7 = 0b00000 | 00
Quadrant A2 (n=29): instructions with func7 in range 0b00001 | 00 to 0b11101 | 00
Quadrant A3 (n=2, but could grow if needed): instructions with func7 = 0b1111X | 00
Quadrant B (n=33): func7 value in range 0b00000 | 10 to 0b11111 | 10
Quadrant C (n=32): func7 value in range 0b00000 | 01 to 0b11111 | 01
Quadrant D (n=32): func7 value in range 0b00000 | 11 to 0b11111 | 11
Below is an example of how the above quadrants can be used to organise the BitManip proposed instructions:
func7 | rs2 | rs1 | rd | opcode | func3► | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
▼ | ▼ | 000 | 100 | 001 | 101 | 010 | 011 | 110 | 111 | ||||
Group A1 | 00000.00 | rs2 | rs1 | rd | 0110011 | ADD | XOR | SLL | SRL | SLT | SLTU | OR | AND |
Group A2 | 01000.00 | rs2 | rs1 | rd | 0110011 | SUB | XNOR | SBINV | SRA | ORN | ANDN | ||
00001.00 | rs2 | rs1 | rd | 0110011 | ADDU.W | PACK | SBSET | GREV | MIN | MINU | |||
01001.00 | rs2 | rs1 | rd | 0110011 | SUBU.W | SBCLR | SBEXT | MAX | MAXU | ||||
00010.00 | rs2 | rs1 | rd | 0110011 | ROL | ROR | SLO | SRO | |||||
01010.00 | rs2 | rs1 | rd | 0110011 | BDEP | BEXT | SHFL | UNSHFL | |||||
Group A3 | 11111.00 | rs2 | rs1 | rd | 0110011 | CLMUL | CLMULR | CLMULH | |||||
Group B | xxxxx.10 | ||||||||||||
Group C | 00000.01 | rs2 | rs1 | rd | 0110011 | MUL | DIV | MULH | DIVU | MULHSU | MULHU | REM | REMU |
Group D | rs3/imm5.11 | rs2 | rs1 | rd | 0110011 | FSLI | FSRI | FSL | FSR | CMOVI | CMOV | CMIX |
Note 1: OP-IM, OP32 and OP-32IM are not shown as these are automatically implied by the quadrant in which each instruction is added
Note 2: RORI not included as can be replaced by FSRL/FSLI, and bitmatrix instructions not shown as these are RV64I only and best placed in OP32 with func3=0bX1X
Note 3: Unary instructions not shown, are placed into OP-IM in the slot occupied by CLMULH (ie: Group A3 with func3=0bX01).
This is ARM pack instruction
PKHBT Rd, Rn, Rm ## Rd = Rm[31:16]|Rn[15:0], Bottom of Rn, Top of Rm
PKHTB Rd, Rn, Rm ## Rd = Rn[31:16]|Rm[15:0], Top of Rn, Bottom of Rm
I can easily tell which part is taken from Rn/Rm.
In pack instruction format:
pack rd, rs1, rs2
kind of reverse to the order in assembly express.
Also in funnel shift
fsr rd, rs1, rs3, rs2
Is it a little-endian concept?
In a few instances, the current drafts for the B and V extensions have instructions that give different names to the same operation.
The B extension has PCNT, while the V extension has an instruction called VMPOPC, where "POPC" also stands for population count.
The B extension has ANDC, while the V extension has VMANDNOT and VMORNOT.
For the examples above, one or the other extension (or both) must be changed to avoid gratuitous inconsistencies.
Furthermore, the B extension has CMOV, while the V extension has VMERGE that performs the same function element-wise for vectors. (My preference would be to have SELECT and VSELECT, or SEL and VSEL, but if those are impossible, I propose renaming CMOV as MERGE.)
Finally, the B extension has BEXT which operates on bits, while the V extension has VCOMPRESS that performs the same function on vector elements.
One of the frequently asked questions listed in the document is:
Do we really need all the *W opcodes for 32 bit ops on RV64?
In my opinion, the only *W instructions in the B extension that might have significant value are the rotate instructions, RORW, ROLW, RORIW, and maybe PACKW. To help decide, the document proposes running "proper experiments with compilers that support those instructions". That would be ideal of course, but I'm skeptical the community will wait on the B extension long enough for that to happen. (Plus it's not exactly easy to set up unbiased experiments for these kinds of specialized features.)
In the meantime, I'd like to point out that most of the proposed *W instructions can be substituted by a sequence of only 2 or 3 other instructions. The following are believed to be equivalent sequences (not always unique):
CLZW rd,rs
SLOI rd,rs,32
CLZ rd,rd
CTZW rd,rs
LI temp,-1
PACK rd,temp,rs
CTZ rd,rd
PCNTW rd,rs
PACK rd,zero,rs
PCNT rd,rd
SLOIW rd,rs1,i
SLOI rd,rs1,i
SEXT.W rd,rd
SLOW rd,rs1,rs2
ANDI temp,rs2,31
SLO rd,rs1,temp
SEXT.W rd,rd
SROIW rd,rs1,i
If i > 0:
SLLI rd,rs1,32
SROI rd,rd,(i+32)
If i = 0:
SEXT.W rd.rs1
SROW rd,rs1,rs2
NOT temp,rs1
SRLW rd,temp,rs2
NOT rd,rd
GREVIW rd,rs1,i
GREVI rd,rs1,i
SEXT.W rd,rd
GREVW rd,rs1,rs2
ANDI temp,rs2,31
GREV rd,rs1,temp
SEXT.W rd,rd
SHFLIW rd,rs1,i
SHFLI rd,rs1,i
SEXT.W rd,rd
SHFLW rd,rs1,rs2
ANDI temp,rs2,15
SHFL rd,rs1,temp
SEXT.W rd,rd
UNSHFLIW rd,rs1,i
UNSHFLI rd,rs1,i
SEXT.W rd,rd
UNSHFLW rd,rs1,rs2
ANDI temp,rs2,15
UNSHFL rd,rs1,temp
SEXT.W rd,rd
BEXTW rd,rs1,rs2
If rs2 is a known constant, rs2[63:32] = 0, and at least one bit in rs2[31:0] is a zero (very likely):
BEXT rd,rs1,rs2
If rs2 is a known constant, rs2[63:32] != 0 (unlikely), and at least one bit in rs2[31:0] is a zero:
PACK temp,zero,rs2
BEXT rd,rs1,temp
If rs2 is a known constant and rs2[31:0] = 0xFFFFFFFF (unlikely):
SEXT.W rd,rs1
If rs2 is not a known constant:
PACK temp,zero,rs2
BEXT rd,rs1,temp
SEXT.W rd,rd
BDEPW rd,rs1,rs2
If rs2 is a known constant and rs2[63:31] = 0:
BDEP rd,rs1,rs2
Else:
BDEP rd,rs1,rs2
SEXT.W rd,rd
CLMULW rd,rs1,rs2
CLMUL rd,rs1,rs2
SEXT.W rd,rd
FSLW rd,rs1,rs2,rs3
PACK rd,rs3,rs1
FSL rd,rd,rs2,rd
SEXT.W rd,rd
FSRW rd,rs1,rs2,rs3
PACK rd,rs1,rs3
FSR rd,rd,rs2,rd
SEXT.W rd,rd
Note that a final SEXT.W rd,rd
can be eliminated if the rd result is known to be used only in subsequent 32-bit operations (such as SW or other *W instructions). Other optimizing tweaks are also possible, depending on the circumstance.
Unless the proposed *W instructions can be shown to be much more prevelant than I expect, the combination of rare utility plus relatively easy synthesis from other instructions argues strongly for dropping them.
The document also says:
But they add very little complexity to the core. So the only question is if it is worth the encoding space.
While "very little complexity" may be true, I disagree that it should be dismissed and only encoding space considered. "Very little complexity" certainly lowers the threshold of utility an instruction must demonstrate to be acceptable, but it doesn't make the instruction free to add. There are many other possible instructions of very little complexity that we so far choose to exclude, and these *W instructions perhaps should be among them.
For instance, there's a whole category of instructions that would have more impact but aren't currently included, and that is the unsigned equivalents of the existing RV64I *W instructions: ADDWU, SUBWU, SLLWU, etc. These would be just like the existing *W instructions but instead zeroing the upper 32 bits, as appropriate for an unsigned int
or uint32_t
result type rather than int
or int32_t
. We hardly need to run any experiments to know that such *WU instructions would be used far more frequently than the *W instructions proposed for the B extension.
Are the sbsetiw, sbclriw, and sbinviw encodings with imm[5] != 0 reserved?
I am not sure that hard coding the polynomials for CRC is a good idea. It seems pretty constrained to add a CRC instruction but only support 2 polynomials. Like the fact it does have 0xedb88320, what I call IEEE polynomial, and Castagnoli polynomial 0x82f63b78 does cover a large number of uses.
However, for instance here is a list of polynomials from Philip Koopman, and people may use different ones depending on the application.
https://users.ece.cmu.edu/~koopman/crc/index.html
Honestly, I think using funct7 | rs2 | rs1 | f3 | rd | opcode | R-type
would be better than the unary format. This would allow one to load polynomials from rs2. It would also only need 2 bits in funct7 for B, H, W, and D. Also a single bit in funct7 could select between 0xedb88320 and 0x82f63b78. There are two ways I could see handling the predefined polynomials.
For instance if we wanted to have default to IEEE polynomial we can take advantage RS0 or the zero register by change the pseudo code to this. This also would allow anyone to load any other polynomial from a register, but it must be XOR’ed with polynomial by 0xedb88320 beforehand. For instance if I wanted to use this polynomial 0xeb31d82e XOR’ing gives me this 0x06895b0e which I can then use as the constant I load in the rs2 register.
uint_xlen_t crc32(uint_xlen_t rs1, uint_xlen_t rs2 int nbits) {
for (int i = 0; i < nbits; i++)
rs1 = (rs1 >> 1) ^ ( ( 0xEDB88320 ^ rs2 ) & ~((rs1 & 1) - 1));
return rs1;
}
The other option would be to use a predefined polynomial if rs2 is the zero register. This avoids XOR’ing the desired polynomial. Although, I am not sure XOR’ing is problem since it can easily be done once outside of run time for a given polynomial.
With the current draft of the B extension, RISC-V has single instructions that can be used to zero-extend a byte (ANDI), halfword (PACKW), or word (PACK or ADDIWU) to full 64-bit register width. A word can also be sign-extended in one instruction (ADDIW). But, unless I'm mistaken, there is still no single instruction that can sign-extend a byte or halfword to 64 bits. If there are constituencies out there that make frequent use of signed char
and short
types (embedded applications with limited memory, perhaps?), such instructions might get more use overall than others that are being included.
Similarly, for reading big-endian data, we currently have the ability in only two instructions to load and byte-swap an unsigned halfword (LHU + BSWAP.H), an unsigned word (LWU + BSWAP.W), or a signed word (LWU + GREVIW), but not a signed halfword, which takes three instructions.
So I was looking at the encoding a lot are not fully described, but this one seems obvious too me.
| ??????? | rs2 | rs1 | ??? | rd | 0110011 | ANDC
Here is the current AND:
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 | AND
Why not define ANDC much like Shift Right and Arithmetic shift right. Also ADD
to SUB is very similar. So using this bit to denote negation seems pretty intuitive.
| 0100000 | rs2 | rs1 | 111 | rd | 0110011 | ANDC
While probably not as useful as ANDC it's also logical easy to extend the complemented
inputs for the other bitwise instructions like this.
| 0100000 | rs2 | rs1 | 100 | rd | 0110011 | XORC
| 0100000 | rs2 | rs1 | 110 | rd | 0110011 | ORC
Also when it comes to hardware implementation negation is used when converting a positive number to a negative number (2's complement). Therefore, it would be easy to use the same negation hardware if the same bit was used to denote negation. Also keeping func3 the same means an ALU only has worry about choosing to negate an input.
In https://groups.google.com/d/msg/comp.arch/8MR8_O-wCeE/8pyiGYz8AQAJ Pedro Pereira suggests:
In the latest bitmanip extension document, the popcount opcode is defined as:
rd = pcnt(rs)
a more useful primitive would be:
rd = pcnt(rs1 ^ rs2)
Since the RISC-V has a zero register (x0), the suggested version could
encode the first one as "rd = pcnt(rs ^ x0)".
I don't imagine that reading one extra register and
performing a xor would make the instruction need more cycles.
I propose a compatible, useful, and low cost enhancement to the GREV/GREVI Generalized Reverse instructions.
At present each stage of GREV can either swap each pair of bits or else propagate them unchanged, as determined by the SHAMT bit for that stage.
I propose to perform some other function on each pair that would normally be swapped, with the same function being substituted for "swap" at each stage. The function to be performed is specified by one or more currently unused bits in rs2 or imm e.g. bit 6, or bits 6-7, or perhaps higher numbered bits to allow for 128 bit CPUs.
Supposing two bits are used, the encoding might be:
00: swap the two bits
01: both outputs are the OR of the two input bits
10: both outputs are the AND of the two input bits
11: I don't have a candidate. XOR isn't useful.
Due to DeMorgan's laws it is not necessary to provide both AND and OR, so if a useful 4th function can't be thought of then perhaps only one bit should be used.
EFFECT
I anticipate that OR or AND processing would normally be added to grev.w. grev.h, grev.b, grev.n, or grev.p. I have not evaluated whether use with other mask values is useful.
When used with one of the above masks the effect of OR instead of SWAP is to set the entire field to 1s if any bit in the field is 1. Fields consisting entirely of 0s remain as 0s. The effect of AND is to set the entire field to 0s if any bit in the field is 0. Fields consisting entirely of 1s remain as 1s.
For example, with an input of cbf20097147200ac the output of GREV.B.OR is ffff00ffffff00ff.
APPLICATIONS
If the output of the above GREV.B.OR is inverted to 0000ff000000ff00 then CLZ or CTZ can be used to determine the position of the first or last zero byte in the input value.
Alternatively, the input value could be inverted and then GREV.B.AND will produce the necessary input for CLZ or CTZ.
This is very valuable in efficient implementation of C string processing functions such as strlen(), strcpy(), strcmp().
Along with general benefits, this will provide a large boost to RISC-V scores in Dhrystone.
Using GREV.H.OR provides the same functionality for UTF-16 or UCS-2, or GREV.W.OR for UTF-32 or UCS-4.
COST
GREV is dominated by wire cost. The logic at each node is extremely small and increasing its size will not meaningfully impact the cost of GREV in either SoC or FPGA.
In particular, in FPGAs with splitable 6-LUTs we have five inputs (the two input bits, the SHAMT swap/fn enable for the stage, and my proposed two function select bits) and these inputs determine two independent bit outputs -- a perfect fit.
The attached program provides a reference C implementation of the proposed modification, checks that it produces the same output as the existing reference implementation when the function select bits are 0, and demonstrates finding all-zero fields of widths 4 to 32.
Please, please, please, in the assembly language for the ternary instructions, do not place the control operand of CMIX between the other two source operands, do not place the condition operand of CMOV between the other two source operands, and do not place the shift amount for funnel shifts (FSL and FSR) between the other two source operands.
I understand the hardware motivation for having the control and shift amounts be in rs2, which forces the other two operands to be rs1 and rs3. But it would be better to define the assembly language for these instructions as
CMIX rd,rs2,rs1,rs3
CMOV rd,rs2,rs1,rs3
FSL rd,rs1,rs3,rs2
FSR rd,rs1,rs3,rs2
Whatever extra trouble a nonlinear operand order might cause for tools authors, it is nothing compared to the multiplicative effect of foisting an illogical order on programmers. Let's not forget, there are literally thousands of programmers for every tools author, and we'd prefer as often as possible to help those programmers write bug-free code.
I note that assembly language pseudo-instructions already provide some precedent for breaking a definite connection between operand order and source register numbers. Store instructions are another existing exceptional case, being written as
SW rs2,offset(rs1)
and not
SW rs1,rs2,offset
Hi all,
I don't understand the encoding for the SLLI, SRLI, ... ROR family of instructions.
Bitmanip v0.9 spec, page 35:
The SLLI* instruction aforementioned is not defined this way in the spec. There
is a 6-bit immediate field, not 7, as shown below for rv64i (page 30 of ISA spec v2.2):
Moreover the assembler code treats the first two fields like a funct6, followed by a 6-bit immediate:
... opcode/riscv.h:
#define OP_MASK_SHAMT 0x3f
#define OP_SH_SHAMT 20
... riscv-opc.c:
#define USE_BITS(mask,shift) (used_bits |= ((insn_t)(mask) << (shift)))
...
case '>': USE_BITS (OP_MASK_SHAMT, OP_SH_SHAMT); break;
Could someone please explain what's going on here? And if this is correct, should it not have its own encoding format?
Is there anyway to build the linux toolchain with B support?
I took a look at the scripts and it looks like you could add a compile for the linux tools(??)
I tried adding gcc optimization support for the b extension. This is one day of work, so I only added the easy ones, didn't verify the results with execution, and haven't tried to handle every case. The assembler is missing support for the rev and zext aliases but I can emit the pack and grevi instructions for those. The assembler is missing support for the addwu, subwu, addu.w, subu.w, and slliu.w instructions, so those are disabled though I am able to generate them.
This patch doesn't affect dhrystone, but for coremark I see a 280 byte reduction in size, which is about 1.5%, with 99 pack instructions and 3 max instructions. Then I realized I had the signed ee_u32 hack in my tree, so I tried undoing that. Now I see a 384 byte reduction in size, which is about 2%, with 190 pack instructions, 3 max instructions, and 1 maxu instruction. We can perhaps get better results with support for the missing addwu etc instructions.
Hi,
I got below output when using spike to run B extension.
core 0: 0xffffffff80001742 (0x60191b93) ctz (args unknown)
core 0: 0xffffffff80001746 (0x61a01a33) rol (args unknown)
core 0: 0xffffffff8000174a (0x003199a3) sh gp, 19(gp)
core 0: 0xffffffff8000174e (0x406eeeb3) orn (args unknown)
core 0: 0xffffffff80001752 (0x41f6fbb3) andn (args unknown)
core 0: 0xffffffff80001756 (0x0aeef133) maxu (args unknown)
Here is my command.
spike --isa=rv32imcb -l test.o
Any suggestion? Thanks
@cliffordwolf
Hi Clifford,
I wonder if you could clarify something.
In my model containing the B extensions, I am getting a decode conflict reported for FSRI during our static analysis phase, could you clarify this for me?
Here are the decode definitions
DECODE_ENTRY(0, RORI, "|01100.......|.....|101|.....|0010011|");
DECODE_ENTRY(0, SBEXTI, "|01001.......|.....|101|.....|0010011|");
DECODE_ENTRY(0, SROI, "|00100.......|.....|101|.....|0010011|");
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0010011|");
a '.' indicates a wildcard, so as you can see FSRI overlaps with RORI, SBEXTI and SROI
the FSRI[26] is part of the immediate value in the RORI, SBEXTI and SROI instructions, and
FSRI[31:27] is rs3, but part of decode in RORI, SBEXTI and SROI
what are your thoughts ?
could this be a documentation error, and FSRI should be
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0110011|");
not
DECODE_ENTRY(0, FSRI, "|.....1......|.....|101|.....|0010011|");
Thx
Lee
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.