Discussion RISC V Latest Developments Discussion [No Politics]

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Some background on my experience with RISC V...
Five years ago, we were developing a CI/CD pipeline for arm64 SoC in some cloud and we add tests to execute the binaries in there as well.
We actually used some real HW instances using an ARM server chip of that era, unfortunately the vendor quickly dumped us, exited the market and leaving us with some amount of frustration.
We shifted work to Qemu which turns out to be as good as the actual chips themselves, but the emulation is buggy and slow and in the end we end up with qemu-user-static docker images which work quite well for us. We were running arm64 ubuntu cloud images of the time before moving on to docker multi arch qemu images.

Lately, we were approached by many vendors now with upcoming RISC-V chips and out of curiosity I revisited the topic above.
To my pleasant surprise, running RISC-V Qemu is smooth as butter. Emulation is fast, and images from Debian, Ubuntu, Fedora are available out of the box.
I was running ubuntu cloud images problem free. Granted it was headless but I guess with the likes of Imagination Tech offering up their IP for integration, it is only a matter of time.

What is even more interesting is that Yocto/Open Embedded already have a meta layer for RISC-V and apparently T Head already got the kernel packages and manifest for Android 10 working with RISC-V.
Very very impressive for a CPU in such a short span of time. What's more, I see active LLVM, GCC and Kernel development happening.

From latest conferences I saw this slide, I can't help but think that it looks like they are eating somebody's lunch starting from MCUs and moving to Application Processors.
1652093521458.png

And based on many developments around the world, this trend seems to be accelerating greatly.
Many high profile national and multi national (e.g. EU's EPI ) projects with RISC V are popping up left and right.
Intel is now a premium member of the consortium, with the likes of Google, Alibaba, Huawei etc..
NVDA and soon AMD seems to be doing RISC-V in their GPUs. Xilinx, Infineon, Siemens, Microchip, ST, AD, Renesas etc., already having products in the pipe or already launched.
It will be a matter of time before all these companies start replacing their proprietary Arch with something from RISC V. Tools support, compiler, debugger, OS etc., are taken care by the community.
Interesting as well is that there are lots of performant implementation of RISC V in github as well, XuanTie C910 from T Head/Alibaba, SWerV from WD, and many more.
Embedded Industry already replaced a ton of traditional MCUs with RISC V ones. AI tailored CPUs from Tenstorrent's Jim Keller also seems to be in the spotlight.

Most importantly a bunch of specs got ratified end of last year, mainly accelerated by developments around the world. Interesting times.
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
So all the UV light sources that ASML litho machines use are manufactured in the US?
No, but about half their machines by value are made in Wilton. And all their EUV light sources in San Diego.
ASML is heavily invested in the US and like any other company heavily invested in the US, it is required to follow US export restrictions.
But as soon as they are 'free' of the US physically and intellectually they will be free of their export restrictions.
 
Last edited:
  • Like
Reactions: Tlh97 and soresu

soresu

Diamond Member
Dec 19, 2014
3,190
2,463
136
No, but about half their machines by value are made in Wilton. And all their EUV light sources in San Diego.
ASML is heavily invested in the US and like any other company heavily invested in the US, it is required to follow US export restrictions.
But as soon as they are 'free' of the US physically and intellectually they will be free of their export restrictions.
I do wonder if maybe the unexpected delays in debut of the new ground up ARM core µArch are due to a change in design philosophy to circumvent such restrictions.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,704
1,230
136
Been going through the later RISC-V events:
2xp550and1xp670.jpeg

Two P550 boards in the 1st Half and One P670 board in the second half.

As well as some of the more interesting Ventana V2 stuff:
executionengine.jpeg
macroops.jpeg
vectorengine.jpeg
 

DrMrLordX

Lifer
Apr 27, 2000
21,998
11,555
136
DUV is older gear
True, but only a few companies can make it, and there are plenty of things you can do with the rough equivalent of TSMC N7. SMIC can't produce its own DUV gear, but there are plenty of players in China that are either already doing it or would love the chance to shave a little cost off their bottom line while serving SMIC domestically. The incentive to copy foreign tech exists with or without trade embargos.

Of course ASML wouldn't want to sell cutting-edge EUV gear knowing that it could possibly be copied in less than a decade. Possibly. My point is, you need to be cautious selling ANYTHING in the PRC.
 

soresu

Diamond Member
Dec 19, 2014
3,190
2,463
136
Of course ASML wouldn't want to sell cutting-edge EUV gear knowing that it could possibly be copied in less than a decade
That is to be expected anyway given the whole "keep em 2 gens off at least" rule limiting what equipment they can import from western suppliers has been going for much longer than the previous US administration.

Given the interesting things going on with research covering quantum dots and metalenses in the UV light range though I don't think it will take quite so long for them to crack it now as it did for ASML, whose solution is pretty complex from what I've read into it.
 

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,353
136

NostaSeronx

Diamond Member
Sep 18, 2011
3,704
1,230
136
Interesting. Pretty cheap price but also pretty terrible performance. Obviously not meant to compete on performance but when even a Atom processor from over a decade ago smokes it, it's hard to get excited at the possibilities.
C910's B-ext is loaded in custom opcode. As well as no V-ext support w/ rollback support from geekbench.

GB5 - Single core: C910 vs A72
gb5.png
GB6 - Single core: C910 vs A72
gb6.png

RVV LMUL=1 versus Neon with same units is almost 2x increase in performance in Computer Vision(CV) workloads. While RVV LMUL=2 is cheap on the C910, which is not identical. A72 = 2x64-bit NEON and C910 = 2x128-bit RVV.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
6,025
10,353
136
C910's B-ext is loaded in custom opcode. As well as no V-ext support w/ rollback support from geekbench.

GB5 - Single core: C910 vs A72
View attachment 94686
GB6 - Single core: C910 vs A72
View attachment 94687

RVV LMUL=1 versus Neon with same units is almost 2x increase in performance in Computer Vision(CV) workloads. While RVV LMUL=2 is cheap on the C910, which is not identical. A72 = 2x64-bit NEON and C910 = 2x128-bit RVV.

That’s nice for those who want it, but I personally don’t care about the AI stuff.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,704
1,230
136
That’s nice for those who want it, but I personally don’t care about the AI stuff.
The current designs out are VEC > OoO Int/Fp. The best case workloads are stuff that do not require strong cores.

The bare minimum to get beyond VEC-happy code:
ax65.jpeg
4 or more Integer ALUs.

CN has yet to officially deploy or showcase Wide+Vector in corpo. It is all Narrow+Vector which for perf. requires converting RVGC flows to RVV flows. To get maximum performance for such cores. Hence, why C908(2022) and C920(2023 upgrade for 2019(C910)) are AI-leaning.

Alibaba is a partner for Kunminghu though:
alibaba.jpeg
4+2+1 ALU/MDU/MISC + 4+2 FPU/FMISC + 2+1+1 LD/ST/LD-ST, unspecified VPU-width(1x128, 2x128, 3x128, etc) VLEN=128b (like XT910/C910/C920)
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,704
1,230
136
That's the price to pay when all CPU support vastly different extensions.

That being said R-V Geekbench still is in beta so there's room for improvement.
Overall that isn't an issue now going forward. As the XT910/C910 was a wild-west pre-[riscv-profiles] core. While the C920 is a civilized post-[riscv-profiles] core. Most of the cores with Linux/Android/Windows targets coming out are at least RVA22 Ratified Compatible.

Software will be easier to port to a set standard than no standard.

~~~~ Another note not in reply to quote ~~~~
Banana Pi (BPI-F3); SpacemiT X60 cores in the SpacemiT K1 SoC
1.3x Performance of A55
0.8x Power of A55
RVA22 + RVV w/ VLEN=256-bit
TDP: 3~5W

Low-power RISC-V cores have caught up to low-power ARMv9-A cores.
 
Last edited:

SarahKerrigan

Senior member
Oct 12, 2014
735
2,034
136
Overall that isn't an issue now going forward. As the XT910/C910 was a wild-west pre-[riscv-profiles] core. While the C920 is a civilized post-[riscv-profiles] core. Most of the cores with Linux/Android/Windows targets coming out are at least RVA22 Ratified Compatible.

Software will be easier to port to a set standard than no standard.

~~~~ Another note not in reply to quote ~~~~
Banana Pi (BPI-F3); SpacemiT X60 cores in the SpacemiT K1 SoC
1.3x Performance of A55
0.8x Power of A55
RVA22 + RVV w/ VLEN=256-bit
TDP: 3~5W

Low-power RISC-V cores have caught up to low-power ARMv9-A cores.

"0.8x power" and "20% higher efficiency at 30% higher performance" - which is what they actually claim - are not even remotely the same thing, and I suspect you know that, Mr. Tunnelborer And Crane.

Even beyond that, the question of "at what?" is relevant and unanswered. If it's about SIMD throughput, then that doesn't mean much.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,704
1,230
136
"0.8x power" and "20% higher efficiency at 30% higher performance" - which is what they actually claim - are not even remotely the same thing, and I suspect you know that.
Under '卓越的CPU性能'
"单核CPU算力领先ARM A55 30%以上"
"leads ARM A55 by more than 30%"

Under '领先的算力能效'
"RISC-V架构的精简和卓越的微架构设计,算力能效比ARM A55高20%以上"
"more than 20% higher than that of ARM A55"

However I am using this for power: "K1 chip can reduce 20% of ineffective energy waste compared with similar chips."
There is also a fixed performance metric locking it at 30%. So: 1.3x/0.8x is valid since the numbers I am quoting are different.

They are separate and not together. It is an either/or statement because of separation. 30% higher performance OR 20% lower power. It is also applied to different scenarios leading to the difference; With the 30% higher performance being associated to a single core(vs Single-core A55). While the 20% lower power being associated to the whole SoC (vs Octo-core A55).

Fixed target reflects the K1 chip implementation. While the unfixed target reflects the X60 cores regardless of implementation: More than 1.3x higher performance :: More than 1.2x higher power efficiency.

Basically, this is being done:
arma510vsa55.jpeg
X60-K1 1.3x performance is at same power relative to single-core A55.
X60-K1 0.8x power reduction is at same performance to octo-core A55.
Of which, these numbers aren't related to the total performance increase or total power efficiency increase of X60, just the K1 chip.
Even beyond that, the question of "at what?" is relevant and unanswered. If it's about SIMD throughput, then that doesn't mean much.
Same tasks given A55 vs X60.
A55: Dual-issue InO 8-stage
X60: Dual-issue InO 9-stage

Larger market, it means that most of the budget smartphones that use the 2xA76+6xA55 config can be replaced by a 2*X100+6*X60 system. While still getting a performance/power eff. boost of ARMv9-A without the changed deal license(per-chip cost -> per-device cost).

Latest device though is this one: "1*A76 @ 2.7GHz+3*A76 @ 2.3GHz+4*A55 @ 2.1GHz" which is meant to come out by 2Q'24. Which we will have to wait for the "K1 Max" SoC which has X100 cores. To see if X100 goals are also beyond ARMv8-A cores.

As well as capability to replace the A53 re-releases: https://www.gsmarena.com/xiaomi_redmi_a3-12822.php

K1 -> [SC9863A/8581E/A8581] // Allwinner A523 if K1 wasn't 28-nm.
K1 50k DMIPS <-- https://www.unisoc.com/en_us/home/TQCDZ-A8581-2 :: 30k DMIPS.

K1 Max -> [T820/P7885/A7870] // X100 is 2.3 GHz on 12nm making it succeed the [T740/T710/7863/A7862] and the Allwinner A736. Thus on par with the 6nm products. As well as that one 8nm product.
etc.

Basically a perfect fit for the 200K one:
roadmap.jpeg

Also, refound the Vector config of Xiangshan v3/Kunminghu:
vectorunit.jpeg
VLEN=128b and 2x128 units re-uses the 4x FPU.

Xiangshan is still the highest performing OoO-sided processor so far announced. It is also not frozen till it pops up here: https://github.com/OpenXiangShan/XiangShan/tags
Kunminghu V1 (3.0) => 4x ALUs + 2x 128b VALUs + 1x 128b VMISC // December 2024 *based on prior releases*
Kunminghu V2 (3.x) => 6x ALU (MDU gets standard ALUs) + 4x 128b VALU + 2x 128b VMISC // Anytime after December 2025 *following Nanhu V2 -> V3*
 
Last edited:

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
So the XiangShan HPCA'24 slides are now online, there are only a few changes from the MICRO'23 ones:

V3 has been simulated with SPECint2006, and got 44.98@3GHz in December, with RV64GCB. (The RVV backend hasn't been merged yet, and from my tests two weeks ago isn't stable yet, and hung after a few minutes)

Plan for V2 based "Mid Core" and V3 based "Big Core". (I don't think this was meant in the BIG/LITTLE sense)
 

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
So, I just tried running the new OpenXiangShan backend again, and it seems to work except for vrgather.vv, so I've got some benchmarks against my 1600X desktop for y'all:


The benchmark:
  • The measurements are from the simdutf vectorized utf8 to utf16 conversion routines, using my PR for the RVV implementation.
  • Both vectorized versions assume valid input and only bounds checks, because utf8 validation requires vrgather.vv in RVV and that currently doesn't work in XiangShan.
  • The XiangShan results are from the DefaultConfig.
  • The results were averaged on x86, and just one sample on XiangShan, because it was running using verilog simulation, which is incredibly slow.
  • The capitalized inputs are from the lipsum dataset, which contains lore ipsum style text, this quite regular. The others are the source code of wikipedia entries in the respective languages and are closer to real world data.
  • The numbers are in input bytes/cycle, so the bigger, the better. You can multiply the numbers by clock frequency to get approximately GB/s.

XiangShan scalar RVV speedup
Latin 0.919203 1.218785 1.33x
Japanese 0.239199 0.532492 2.23x
Hebrew 0.148244 0.691389 4.66x
Korean 0.187919 0.504613 2.69x
Emoji 0.302343 0.324324 1.07x
german 0.596167 0.940519 1.58x
japanese 0.292013 0.624463 2.14x
arabic 0.243619 0.801790 3.29x

1600X scalar AVX2 speedup
Latin 3.444410 5.196881 1.51x
Japanese 0.274903 1.132911 4.12x
Hebrew 0.186775 0.722549 3.87x
Korean 0.219586 0.700254 3.19x
Emoji 0.294633 0.459388 1.56x
german 0.686341 1.766784 2.57x
japanese 0.465766 0.879507 1.89x
arabic 0.394321 0.914913 2.32x

Note that this is very specific hand vectorized code for both processors.
While the 1600X has AVX2 with 256-bit per register, and XiangShan only 128, keep in mind that RVV has some more expressive/feature rich instructions.
Particularly vcompress is interesting for the implementation and the AVX512 version does make use of their byte compress instruction.
 
Last edited:

Nothingness

Diamond Member
Jul 3, 2013
3,029
1,971
136
So the XiangShan HPCA'24 slides are now online, there are only a few changes from the MICRO'23 ones:
Very interesting presentation, thanks for sharing!

Unsurprisingly the various flows are similar to what is used in the industry with a welcome open source twist.

V3 has been simulated with SPECint2006, and got 44.98@3GHz in December, with RV64GCB. (The RVV backend hasn't been merged yet, and from my tests two weeks ago isn't stable yet, and hung after a few minutes)
When do you intend on switching to SPECint2017?

Quite curious to see how V3 performs in silicon (both perf and power).

EDIT: Hmm, I noticed there's no mention of formal methods for verification (or I missed it).
 
Last edited:

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
When do you intend on switching to SPECint2017?
I'm not affiliated with the project, but I'd guess they use SPECint2006 because most RISC-V vendors only published that.
EDIT: Hmm, I noticed there's no mention of formal methods for verification (or I missed it).
I'm not sure how they do verification, but maybe they do that more privately. There are some verification docs here.

BTW, if you haven't seen the older slides, then this might also interest you, it goes into more detail on the microarchitecture.
 

soresu

Diamond Member
Dec 19, 2014
3,190
2,463
136
Quite curious to see how V3 performs in silicon (both perf and power).
As with all things it will depends on where they are fabbed and on what node.

The announcement PR for A76 talked about 7nm and yet the RPi5 SoC is fabbed on 12nm 😅

Likewise with RPi4 and A72 on 28nm vs most smartphone SoC's using it on 16/14/12/11nm.
 
  • Like
Reactions: Nothingness

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
The first version of the SiFive p670 schedule model has a PR now, this gives some insights into the expected performance.

I'm not sure I fully understand how the llvm scheduler definition format works, but here is my interpretation:

IssueWidth: 4 micro-ops
MicroOpBufferSize: 160
LoadLatency: 4 cycles (from cache)
MispredictPenalty: 9 cycles
2 Load/Store ports: 2 loads, or 2 stores, or 1 load and 1 store

Int execution units:
IEXQ0: general int, bitmanip, csr, vsetvl, cmov
IEXQ1: general int, bitmanip, imul, idiv, i2f
IEXQ2: general int, bitmanip, branch, cmov
IEXQ3: general int, bitmanip, branch, cmov

Float execution units:
FEXQ0: general float,
FEXQ1: general float, fdiv

Vector execution units:
VEXQ0: general vector, mask, viota/vidx, slide by 1 or immediate, LMUL<=1 slide by X
VEXQ1: general vector, idiv, fdiv, fsqrt, reductions, LMUL>1 slide by x, gather and compress


It looks like all vector operations have an LMUL cycle latency per port, except for mask operations, that have a 1 cycle latency per port, and vfdiv/vfsqrt.
This is probably a simplification by the scheduling model, 8 cycle LMUL=8 vrgather.vv would be crazy.

The latency differs, you can browse through RISCVSchedSiFiveP600.td to see the numbers.
 

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
https://www.imaginationtech.com/products/cpu/apxm-6200/

1.png
Update from https://www.theregister.com/2024/04/08/imagination_riscv_cpu_cores/:

In SpecINT2k6, APXM-6200 cores apparently beat out the popular but aging Cortex-A53 core by 65 percent. Imagination also claims a 38 percent lead over the Cortex-A55 and a 14 percent advantage against the Cortex-A510.

As more recent Arm cores have boosted performance at the cost of area efficiency, the APXM-6200 boasts 2.8 times the density of the Cortex-A55 and 3.4 times that of the Cortex-A510

Imagination tells us that we can probably expect CPUs using APXM-6200 cores to arrive in the second half of next year, perhaps closer to the end of the year
 
Last edited:

FlameTail

Diamond Member
Dec 15, 2021
3,757
2,203
106
Has everybody forgotten about Rivos?


The legendary RISC-V chip startup. Rives is to RISC-V, what Nuvia was to ARM.
 

camel-cdr

Junior Member
Feb 23, 2024
20
65
51
yeah hard to get exited for A72 performance in 2025.
How so? You can't really compare in-order with out-of-order cores, they have completely different targets.
This is an established company testing the waters with their own RISC-V design that "supposedly" beats the currently fastest arm in-order-core A520 in terms of performance (A520 is supposed to be 8% faster than A510, and APXM-6200 is supposed to be 14% faster than A510).

BTW, OpenXiangShan hast merged vector support to master yesterday. Not that it fully works yet though, the simulation freezes in some of my benchmarks, I'll have to investigate and write an issue.