Discussion RISC V Latest Developments Discussion [No Politics]

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,746
6,587
136
Some background on my experience with RISC V...
Five years ago, we were developing a CI/CD pipeline for arm64 SoC in some cloud and we add tests to execute the binaries in there as well.
We actually used some real HW instances using an ARM server chip of that era, unfortunately the vendor quickly dumped us, exited the market and leaving us with some amount of frustration.
We shifted work to Qemu which turns out to be as good as the actual chips themselves, but the emulation is buggy and slow and in the end we end up with qemu-user-static docker images which work quite well for us. We were running arm64 ubuntu cloud images of the time before moving on to docker multi arch qemu images.

Lately, we were approached by many vendors now with upcoming RISC-V chips and out of curiosity I revisited the topic above.
To my pleasant surprise, running RISC-V Qemu is smooth as butter. Emulation is fast, and images from Debian, Ubuntu, Fedora are available out of the box.
I was running ubuntu cloud images problem free. Granted it was headless but I guess with the likes of Imagination Tech offering up their IP for integration, it is only a matter of time.

What is even more interesting is that Yocto/Open Embedded already have a meta layer for RISC-V and apparently T Head already got the kernel packages and manifest for Android 10 working with RISC-V.
Very very impressive for a CPU in such a short span of time. What's more, I see active LLVM, GCC and Kernel development happening.

From latest conferences I saw this slide, I can't help but think that it looks like they are eating somebody's lunch starting from MCUs and moving to Application Processors.
1652093521458.png

And based on many developments around the world, this trend seems to be accelerating greatly.
Many high profile national and multi national (e.g. EU's EPI ) projects with RISC V are popping up left and right.
Intel is now a premium member of the consortium, with the likes of Google, Alibaba, Huawei etc..
NVDA and soon AMD seems to be doing RISC-V in their GPUs. Xilinx, Infineon, Siemens, Microchip, ST, AD, Renesas etc., already having products in the pipe or already launched.
It will be a matter of time before all these companies start replacing their proprietary Arch with something from RISC V. Tools support, compiler, debugger, OS etc., are taken care by the community.
Interesting as well is that there are lots of performant implementation of RISC V in github as well, XuanTie C910 from T Head/Alibaba, SWerV from WD, and many more.
Embedded Industry already replaced a ton of traditional MCUs with RISC V ones. AI tailored CPUs from Tenstorrent's Jim Keller also seems to be in the spotlight.

Most importantly a bunch of specs got ratified end of last year, mainly accelerated by developments around the world. Interesting times.
 

naukkis

Senior member
Jun 5, 2002
853
726
136
Rvv is vector isa. It has really different implementation than simd cpus. Yeah vector isa cpu does not need to run OOO, there is enough parallelism to implement hardware in-order. But instead ooo hardware needs to implement those permute possibilities to be able keep data in registers. Rvv is pure vector machine and can support binaries on different execution widths unlike sve which is stupid halfway implementation between vector and simd.
 

naukkis

Senior member
Jun 5, 2002
853
726
136
I think I already made lists about the shortcomings.

For instance the lack of register offset and shift in addressing modes. Arm and x86 have had that since the beginning as it maches pointer access used in HLL.

If that's found to be too time critical, then split it. But when you don't have it you're bound to emit several instructions to get the same behavior which kills code density and forces you to fuse instructions to get to higher performance levels. That is such a failure that some companies have started adding that as a private extension, and IIRC even R-V has an extension to partially alleviate the problem.

Thats not true. Rv code density is fine and instruction fusion aint forced to be used. Only when trying to convert arm hardware design to rvv thats a problem - existing hardware implementation are useless without supporting them with extensions or trying to get instructions to fused arm like. But there is already many well working rv cores pointing out that rv philosofy is fine.
 
  • Haha
Reactions: Nothingness

Nothingness

Diamond Member
Jul 3, 2013
3,017
1,945
136
Thats not true. Rv code density is fine and instruction fusion aint forced to be used. Only when trying to convert arm hardware design to rvv thats a problem - existing hardware implementation are useless without supporting them with extensions or trying to get instructions to fused arm like. But there is already many well working rv cores pointing out that rv philosofy is fine.
Yes sure having to use three instructions to make a pointer access is the way to go for density. Then why did R-V add an extension to reduce the issue? Then why did at least one company add a proprietary extension?

If code density is good then why did RISC-V add *seven* extensions related to code size? Zca, Zcb, Zcd, Zce, Zcf, Zcmp, Zcmt

RISC-V philosophy is obvious: it was designed by a student who had no experience in ISA design and wanted to be able to implement things alone. He didn't even add integer mul or divide, which had to come later as an extension. That was the only initial advantage of R-V: it's so primitive, you don't need many people to make a ridiculously low perf CPU. A good student project turned into a poor industry ISA.

It's now growing out of control with all the extensions needed to make it a good performing ISA.

Turn it all around you want, RISC-V is the worst ISA designed in the last 20 years. If it was not for an aging university prof who was left with his memories from an age where he made sense and made remarkable contributions, we would never have heard of that pitiful thing.

BTW I also have a "well working" Turing machine. It's just as useless as most RISC-V designs.
 

naukkis

Senior member
Jun 5, 2002
853
726
136
He didn't even add integer mul or divide, which had to come later as an extension. That was the only initial advantage of R-V: it's so primitive, you don't need many people to make a ridiculously low perf CPU. A good student project turned into a poor industry ISA.
Thats just well designed extendable and upward compatible ISA. No all cpus need mul or div. To make extremely small cpu other instruction sets need that mul and div microcoded which leads to bigger microcode engine than whole cpu execution hardware. Rv is only standard widely used ISA that scales well to those designs -and have dominant market share is those designs today.
 

Nothingness

Diamond Member
Jul 3, 2013
3,017
1,945
136
Thats just well designed extendable and upward compatible ISA. No all cpus need mul or div. To make extremely small cpu other instruction sets need that mul and div microcoded which leads to bigger microcode engine than whole cpu execution hardware. Rv is only standard widely used ISA that scales well to those designs -and have dominant market share is those designs today.
As usual you only answer the part of the discussion for which you have an answer and which doesn't even address my points.

BTW no CPU uses microcoded MUL or DIV. Let me guess, you have no experience in CPU design beyond toy projects?
 

Nothingness

Diamond Member
Jul 3, 2013
3,017
1,945
136
Try it with O2. Identical. Use clang O2 and only x86 vectorizes it. Faster code has more instructions.
Yes I know. I was only addressing the code size aspect with the standard flag used for that to show the issue with RISC-V limited addressing modes.

Now try -O3 and it gets vectorized on Arm too and code size explodes, but performance will be much higher: https://godbolt.org/z/9za8oMzcc
x86-64 and Arm codes are similar doing 4 adds with 7 instructions in the critical loop while R-V is stuck doing 1 add with 7 instructions. There might be some flag to enable vectorizing on R-V, but then what CPU would support it?

BTW it's interesting gcc only starts vectorizing at -O3 on Arm while it does at -O2 on x86-64. Didn't check what clang does.
 

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
@camel-cdr Do you want to share your thoughts on here?
I've responded to the original twitter thread, so I'll just copy past my comment on r/riscv that paraphrases the answers:


Regarding RVC decode complexity

I think the decode is missing part of the picture.

1725226421481.png

For a fixed size isa to be competitive it needs to have more more complex instructions that need to be cracked into uops, at which point you already have a scaling similar to RVC style variable length decoding.

I'd also argue that RISC-V has fewer uops to crack in the float and int pipelines. Yes, LMUL requires cracking, but that's a lot more throughput oriented and can be done more easily later in the pipeline, because it's more decoupled.

If you look at the Apple Silicon CPU Optimization guide, you can see that it's even worse than in the edited picture because instructions are cracked into up to 3 uops. This includes common instructions like pre-/post-index load/stores and instructions that cross register files.

The Cortex X4 software optimization guide wasn't released yet, but let's look at the one from the Cortex X3: Again, pre-/post-increment loads/stores: 2/3 uops.

We already have open-source implementations that can reach 4 IPC, and commercial IP that can go above, and has >8 wide decode.


Regarding RVV

IPC is a useless metric for RVV, since LMUL groups instructions. If you consider LMUL*IPC, then it's incredibly easy to reach >4 IPC, because of the implicit unrolling.

Regarding 6x src/dst, the count doesn't really matter, the bits do. Implementations have a separate register file for vtype/vl, and do rename on that. Yes ooo implementations need to rename, predict and speculate vtype/vl, that was expected from the beginning. ta/ma get's rid of the predication, mu only applies if you use a masked instruction, and tu can be treated as ta, if you know/predict vl=VLMAX.

From what I've seen, most high perf ooo implementations split LMUL>1 into LMUL=1 uops, but implementations differ in when they do the splitting. We already have out-of-order RVV implementations, even multiple open source ones.

1725226478059.png1725226510333.png

* https://github.com/OpenXiangShan/XiangShan
* https://github.com/riscv-stc/riscv-boom/tree/matrix

The XiangShan one is still missing vtype/vl prediction, however, that is currently WIP.
 
Last edited:

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
  • Like
Reactions: Nothingness

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
Try it with O2. Identical. Use clang O2 and only x86 vectorizes it. Faster code has more instructions.
clang O2 vectorizes it on both arm and RISC-V: https://godbolt.org/z/TGvWKWch3
x86-64 and Arm codes are similar doing 4 adds with 7 instructions in the critical loop while R-V is stuck doing 1 add with 7 instructions. There might be some flag to enable vectorizing on R-V, but then what CPU would support it?

Both do 4 adds per loop, but Arm takes 10 instructions, while RISC-V takes 8. If we are fair, and expand the load pair, and LMUL=2 instructions, then we got Arm 12 uops, and RISC-V 20 uops.
clang currently defaults to rv64gc, but it's expected to change in the future to the application profiles.
 

naukkis

Senior member
Jun 5, 2002
853
726
136
BTW no CPU uses microcoded MUL or DIV. Let me guess, you have no experience in CPU design beyond toy projects?
You are clueless. All really small cpu designs have mul and div microcoded - and hardware not microcoded div like radix are only found on quite new big cpu designs. For example AMD K10 div was ucoded until Liano introduced radix which was broken and it too has to revert back ucoded div.
 
Jul 27, 2020
19,482
13,357
146
@naukkis You up for that senior cpu architect job? https://www.aheadcomputing.com/careers

Senior CPU Architect​

Responsibilities
  • You will innovate and define features to target power, performance, area, and timing goals
  • You will develop and refine microarchitecture, and write high-level architecture specification
  • You will write RTL of complex IP subsystems
  • You will explore high performance strategies and validate that the RTL design meets targeted performance

Qualifications and Skills
  • Thorough knowledge of microprocessor architecture and microarchitecture including high performance and low power trade-offs
  • Thorough knowledge of performance model development
  • Verilog/System Verilog development experience or desire to learn and quickly ramp on RTL design
  • Experience using an interpretive language such as Perl or Python
  • Expert problem solving skills
 
  • Wow
Reactions: FlameTail

naukkis

Senior member
Jun 5, 2002
853
726
136
There is market for small cpu designs too. Good isa will cover those markets too where implementing mul and div in hardware or ucode makes them too big. Rv does cover that nicely with feature levels- arm does it more crudely with making some instructions like div optional. Even Intel thought that market some years ago and planned doing ucodeless subset x86 cpus - but those would have been quite uncompetitive against competition.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,469
1,943
136
For a fixed size isa to be competitive it needs to have more more complex instructions that need to be cracked into uops, at which point you already have a scaling similar to RVC style variable length decoding.

...

If you look at the Apple Silicon CPU Optimization guide, you can see that it's even worse than in the edited picture because instructions are cracked into up to 3 uops. This includes common instructions like pre-/post-index load/stores and instructions that cross register files.

It's not that simple. There are ways to do this so that your frontend does not blow up, for example the way AMD emits only one op from the frontend, but it gets issued into multiple different types schedulers. The classic way is how K7 tracked a load-alu-store op as a single instruction, which would get issued to different units (agu, then alu, then store) serially. I think on Zen, a read-alu op doesn't get emitted from the frontend as two ops, but as one op that gets duplicated on insertion into the relevant queues. I don't know much about Apple's cores, but I think they use a similar technique. At no point is that post-index load expanded into more than one entry in the same queue, it just gets issued simultaneously into multiple queues, while most simple ops only gets issued into one.