NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Programming with RISC-V Vector Instructions (gms.tf)
filereaper 1350 days ago [-]
Wow, this RISC-V ISA truly is quite brilliant.

Having worked on PPC, with a RISC Instruction set the difference between 128 bit vs 256 bit instructions really does eat up the limited opcode space for really trivial differences.

Also having written say one fast version of vector array copy can now just be used between different vector version lengths, no need to write different versions to exploit expanded width, and same goes for many vector compiler optimizations.

How does this work with physical registers as opposed to architectural ones? Typically in PPC the 128 bit and 256 bit ones were architecturally and physically non-overlapping so you did get extra registers when you go from 128 to 256 or 512. I don't know if that's the case for RISC-V here.

But yeah, brilliant looking forward to more!

brigade 1350 days ago [-]
The 32 variable length registers don’t overlap; the register grouping is orthogonal to the length of each register.

The grouping... sounds “interesting” to implement in an OoOE design. Most obvious would be to have the instruction decoder emit one uop per register in the grouping... but that means vsetvli would have to stall decoding until it’s resolved. But that also seems to be how element size is set, so that would kill the performance of mixing precision in the same kernel...

Well I guess it could assume the grouping doesn’t change and flush the pipeline if it did. But you still don’t want to be mixing kernels with different groupings...

brucehoult 1349 days ago [-]
It's absolutely essential to be able to change vsetvli in the middle of a kernel.

It is expected that in a future expanded Vector instruction set with 48 bit or 64 bit opcodes the vtype will be explicitly encoded in every instruction and can change every instruction.

Right now the vsetvli is setting (slightly) persistent state that affects following instructions. You're allowed to put one before every vector instruction if you want, without significant execution penalty -- there will be a little, from extra instruction fetch and decode -- but similar to doing, say, an integer add between each vector instruction.

The natural implementation even now is to have each vector instruction pick up the current vtype when it is decoded and carry it along with it through the pipeline as a few extra bits of opcode.

You certainly don't want to have any stalls or pipeline flushes just because the vtype changes.

lukeplato 1350 days ago [-]
I highly recommend Lex Fridman's recent podcast with David Patterson [1] for anyone interested in learning about the history of RISC, computer architecture, and also interesting predictions re: Moore's Law.

[1] https://www.youtube.com/watch?v=naed4C4hfAg

asdajsdh 1350 days ago [-]
Like a year ago I only learned about Patterson, because I got a (random) book out of my uni library about computer architectures. His books are fascinating. He is fascinating.
twic 1350 days ago [-]
Irrelevant i know, but:

> For the purpose of our example, the exercise is to write vector code that efficiently converts a BCD string such as { 0x12, 0x34, ..., 0xcd, 0xef } to a corresponding ASCII string (e.g. { '1', '2', '3', '4', ..., 'c', 'd', 'e', 'f' }). On a high-level, a solution involves separating the nibbles into single bytes and then converting each byte to the matching ASCII value.

If your BCD string has 0xcd or 0xef in it, it's not BCD, is it? It's "binary coded hexadecimal", or as we usually call it, "binary".

This code converts a byte string to its hex representation. It has nothing to do with BCD, right?

microcolonel 1349 days ago [-]
> This code converts a byte string to its hex representation. It has nothing to do with BCD, right?

BCD to ASCII is a strict subset of bin to hex ASCII; and in this case there is no runtime cost to supporting both. This also covers nybble-coded octal.

ghusbands 1349 days ago [-]
Describing something that produces a hex string from binary as "BCD to ASCII" is unusual and misleading. And claiming that 0xcd and 0xef are BCD is simply incorrect.
brucehoult 1349 days ago [-]
This is a reasonably good example (except it's binary to hex, not just BCD to hex), but it's from the start of the year and based on the already out of date version 0.8 draft spec.

A couple of things need to be changed to bring it up to date:

  -vlbu.v v16, (a1)
  +vle8.v v16, (a1)
  +vzext.vf2 v16, v16

  -vsb.v v24, (a0)
  +vse8.v v24, (a0)
I'd also probably make (or at least compare the speed of) one more change:

  -vrgather.vv v24, v8, v16
  +vmsgtu.vi v0, v16, 9 # set mask-bit if >9
  +vadd.vi v16, v16, '0' # add '0' to each element
  +vadd.vi v16, v16, 'a'-0xA-'0', v0.t # masked add to correct A..F
That's basically the same code as he used to create the lookup table in v8 for the vrgather. I think it might run faster on many machines, and also the vrgather would fail on the smallest machines (with only 32 bits in each vector register) if anyone modified the code to not use m8 (LMUL=8).

For more explanation see my post at https://www.reddit.com/r/RISCV/comments/i5alno/programming_w...

[NB slight cheat -- those vadd.vi immediates are too big to fit .. you'd actually need to put them in integer registers and use vadd.vx, which can be set up outside the loop]

castratikron 1349 days ago [-]
Is there a big penalty for context switching to another process that uses a different vector length?
vlmutolo 1349 days ago [-]
It’s really funny that you asked this because I clicked on the RISC-V “V” extension spec on GitHub, and literally the only thing I read was “you can’t context switch to another CPU with different vector lengths”.

EDIT: Non-layman wording

> Thread contexts with active vector state cannot be migrated during execution between harts that have any difference in VLEN or ELEN parameters.

https://github.com/riscv/riscv-v-spec/blob/master/v-spec.ado...

castratikron 1349 days ago [-]
Very cool, thanks.
lachlan-sneff 1350 days ago [-]
Whoa, that's a well designed ISA.
renox 1349 days ago [-]
The vector extension yes, the C (compressed) extension is unusual: you can have 32 bits instructions aligned on 16bit, while nice for code density this means that the implementation is much more complex than Thumb/MIPS16 extensions..
rwmj 1349 days ago [-]
RISC-V has a variable-length instruction encoding. It's just that unlike x86 you can easily tell from parsing a few bits the length of every instruction in the stream, and like MIPS etc most "ordinary" instructions are 32 bit.

BTW if unaligned 32 bit instructions are a concern there is a Compressed NOP (C.NOP == addi x0, x0, 0 but without RAW hazards).

renox 1349 days ago [-]
C.Nop isn't a solution, if you implement a compliant RISC V processor with the C extension you need to handle all the possible case, for example an instruction straddling several pages.
brucehoult 1349 days ago [-]
Thumb2 -- which is the only ISA available on 32 bit Cortex M devices and the main ISA for a decade now on 32 bit Cortex A devices -- mixes 16 bit and 32 bit opcodes arbitrarily, with 32 bit opcodes frequently on 16 bit boundaries.

It's simply Not That Hard to deal with. You just need to have two 32 bit words in your instruction decode buffer. Sometimes you need the 1st half of the 2nd word and sometimes you don't.

Incidentally, once you've done that, arbitrarily aligned (on halfwords) 48 bit instructions don't need anything extra.

garmaine 1350 days ago [-]
Borrowed from the best.
AnimalMuppet 1350 days ago [-]
For those of us not in the know, from whom were they borrowed?
eddyb 1350 days ago [-]
The RISC-V Vector ISA is credited to Cray "vector processors" [1], explicitly by the RISC-V authors.

Additionally, I believe experimenting with vector ISAs was mentioned as one of the reasons they started another RISC research project, which ended up being RISC-V.

[1] https://en.wikipedia.org/wiki/Vector_processor

CalChris 1350 days ago [-]
Yeah, RISC should have been called CISC, Cray Instruction Set Computer.
garmaine 1350 days ago [-]
Well it is the fifth RISC ISA from the people who invented RISC.
agumonkey 1349 days ago [-]
CRRRRRISC then
mtgx 1350 days ago [-]
This PDF might help, it explains the differences with other ISAs

https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

guerrilla 1350 days ago [-]
Would something like this be possible for FPUs as well? I see there are currently three separate extensions for floating point instructions varying by register widths.
Taniwha 1350 days ago [-]
The vector extension includes FP instructions too
rhn_mk1 1350 days ago [-]
What's the deal with the group sizes here?

vsetvli t0, a6, e8, m8 # switch to 8 bit element size, # i.e. 4 groups of 8 registers

vmsgtu.vi v0, v8, 9 # set mask-bit if greater than unsigned immediate # --> v0 = | 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 |

If vsetvli results in groups of 8 registers, then surely vmsgtu.vi only affects v0 which is the first 8 registers? The following 8 are in v8 if I understood the previous writing correctly.

saagarjha 1350 days ago [-]
As I understand it, it affects the whole “group” at v0, and it’s only storing 16 8-bit elements which likely fit in the full 8 registers you have grouped together. (That is, it’s not storing one element per register, I’m not sure exactly how the layout is but it’s probably packing them somewhere along the lines of, if the register size was 64 bit, v8 would be 0x0706050403020100, v9 would be 0x0f0e0d0c0b0a0908, and then v0 would be 0x0101010101010101 and v1 would be 0x0000000000000101.)

(What personally don’t understand is the point of these register groupings; they seem a bit extraneous and error prone, as you already can set the element size and you have to guess at the minimum vector register size…and why wouldn’t you always just set them to 8? I think ARM’s SVE does something similar but it fixes the size for you essentially.)

nybble41 1349 days ago [-]
AIUI the group size is a trade-off between the number of independently addressable vectors and the vector size. If the group size were always set to eight then you could only have four distinct vectors, whereas a group size of one would give you 32 vectors. You want to use the largest group size you can to take full advantage of the hardware, but you're limited by the number of vectors required by your algorithm. This is orthogonal to the size of an element within each vector.

The current RISC-V "V" draft standard requires the vector registers to be at least 32 bits wide (VLEN ≥ SLEN ≥ 32)[1], so a 128-bit vector with 16 8-bit elements may require up to four registers. Setting the group size to eight is a bit extravagant, but since the total number of elements was limited to 16 any extra registers in the group will not be affected. With a smaller group size you could potentially use those registers for something else.

P.S. I think you have the mask bits reversed. The instruction is "set if greater than" so v1 should be 0x0101010101010000 and v0 should be 0x0000000000000000 (corresponding to the |1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0| state shown in the article for the combined v1:v0 register group).

[1] https://github.com/riscv/riscv-v-spec/releases/tag/0.9

anonymousDan 1350 days ago [-]
Pretty cool. For RISC-V experts out there, can someone explain to me the purpose of the proxy kernel? I can't seem to wrap my head around it. Why not just run a normal kernel (e.g. Linux) on top of the emulator? What advantages/disadvantages does the proxy kernel have?
seldridge 1350 days ago [-]
It gives you a hardware testing environment with system calls for reasonable amounts of wall clock time.

If you're simulating actual hardware, e.g., a Verilog description of a RISC-V microprocessor compiled to a cycle-accurate simulation with Verilator, your simulation rate is going to be ~10KHz. You can write useful tests with the Proxy Kernel (or something like it) that run in ~1 million instructions (minutes of wall clock time) while still getting full system calls like printf. However, booting Linux is out of the question (days of wall clock time). Running bare metal is useful, too, but you don't have system calls there.

If you're doing fast RISC-V virtualization, like on QEMU, or doing emulation on an FPGA, you're running at >1MHz and running a "normal kernel" like Linux is totally tractable. However, it would be foolhardy to expect to jump from hardware design to immediately booting Linux.

fulafel 1349 days ago [-]
Nitpick: printf is is a libc call.
brucehoult 1349 days ago [-]
... and that libc code (usually NewLib[nano]) formats a buffer then calls write(2), which pk provides.
1348 days ago [-]
1348 days ago [-]
brucehoult 1349 days ago [-]
As well as very slow verilog RTL emulation, pk is also used to allow you to run standard User-mode RISC-V binaries with many kinds of systems call in them on:

- software emulators

- cores implemented in an FPGA

- prototype chips

Whatever it is, you only have to implement the CPU core, some RAM, and some sort of two-way communications channel -- whether a pipe, UART, USB Serial, ethernet or WIFI.

On the other end of the communications channel you run riscv-fesvr (Front End SerVeR)

pk traps systems calls, serializes the arguments, send them to fesvr. fesvr unpacks the arguments, makes the system call on the host Linux machine, serializes the results, and sends them back to the RISC-V core running in that FPGA or prototype chip or Verilator or whatever.

So your test programs get to use not only printf() but also navigate the host filesystem, open files, get the time etc etc.

A proper Linux Kernel on the system under test would require megabytes of RAM, various I/O devices etc.

pk requires only a few kb of RAM and a communications channel.

monocasa 1350 days ago [-]
It proxies most everything through HTIF to your host at the syscall layer so you just read(2), etc. to your host FS. It's also dead simple so you don't have to wait for it to boot really.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 22:55:38 GMT+0000 (Coordinated Universal Time) with Vercel.