Discussion RISC V Latest Developments Discussion [No Politics]

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
Some background on my experience with RISC V...
Five years ago, we were developing a CI/CD pipeline for arm64 SoC in some cloud and we add tests to execute the binaries in there as well.
We actually used some real HW instances using an ARM server chip of that era, unfortunately the vendor quickly dumped us, exited the market and leaving us with some amount of frustration.
We shifted work to Qemu which turns out to be as good as the actual chips themselves, but the emulation is buggy and slow and in the end we end up with qemu-user-static docker images which work quite well for us. We were running arm64 ubuntu cloud images of the time before moving on to docker multi arch qemu images.

Lately, we were approached by many vendors now with upcoming RISC-V chips and out of curiosity I revisited the topic above.
To my pleasant surprise, running RISC-V Qemu is smooth as butter. Emulation is fast, and images from Debian, Ubuntu, Fedora are available out of the box.
I was running ubuntu cloud images problem free. Granted it was headless but I guess with the likes of Imagination Tech offering up their IP for integration, it is only a matter of time.

What is even more interesting is that Yocto/Open Embedded already have a meta layer for RISC-V and apparently T Head already got the kernel packages and manifest for Android 10 working with RISC-V.
Very very impressive for a CPU in such a short span of time. What's more, I see active LLVM, GCC and Kernel development happening.

From latest conferences I saw this slide, I can't help but think that it looks like they are eating somebody's lunch starting from MCUs and moving to Application Processors.
1652093521458.png

And based on many developments around the world, this trend seems to be accelerating greatly.
Many high profile national and multi national (e.g. EU's EPI ) projects with RISC V are popping up left and right.
Intel is now a premium member of the consortium, with the likes of Google, Alibaba, Huawei etc..
NVDA and soon AMD seems to be doing RISC-V in their GPUs. Xilinx, Infineon, Siemens, Microchip, ST, AD, Renesas etc., already having products in the pipe or already launched.
It will be a matter of time before all these companies start replacing their proprietary Arch with something from RISC V. Tools support, compiler, debugger, OS etc., are taken care by the community.
Interesting as well is that there are lots of performant implementation of RISC V in github as well, XuanTie C910 from T Head/Alibaba, SWerV from WD, and many more.
Embedded Industry already replaced a ton of traditional MCUs with RISC V ones. AI tailored CPUs from Tenstorrent's Jim Keller also seems to be in the spotlight.

Most importantly a bunch of specs got ratified end of last year, mainly accelerated by developments around the world. Interesting times.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
Google removing support for RISC-V in Android Common Kernel?

Key statement is here

Android will continue to support RISC-V. Due to the rapid rate of iteration, we are not ready to provide a single supported image for all vendors. This particular series of patches removes RISC-V support from the Android Generic Kernel Image (GKI).

Looks like the rate at which RISC-V is evolving is too fast to maintain a stable GKI, this should be pretty obvious from the very beginning.
Folks are creating new extensions and standards are getting ratified like no tomorrow, they seem to be in a rush to create something comprehensive.

But it is no big deal if Google continues to maintain the toolchains (i.e, soong, kati, Bazel ) to natively support RISC-V, the GKI is the least problematic one. There is no GKI for x86 or armv7 as well, they got ahead of themselves
There are bigger things to do here, like ART and Zygote optimizations to name a few.
 

DrMrLordX

Lifer
Apr 27, 2000
22,914
12,983
136
Folks are creating new extensions and standards are getting ratified like no tomorrow, they seem to be in a rush to create something comprehensive.
That's one of the problems for RISC-V, which is the lack of any standards regarding extensions. Sure sure it's great to be able to make your own private extensions for an individual implementation such as a storage microcontroller that will only ever see use in a very specific application. But what about CPUs intended for "general" computing? Critics have been pointing out this problem since RISC-V became subject to public discussion.
 
  • Like
Reactions: Nothingness

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
Looks like the rate at which RISC-V is evolving is too fast to maintain a stable GKI, this should be pretty obvious from the very beginning.
Folks are creating new extensions and standards are getting ratified like no tomorrow, they seem to be in a rush to create something comprehensive.
RISC-V since RVA22 is actually slowing down. There is less and less new extensions and standards being proposed and ratified. The rush of RISC-V extensions/standards is no where near where it was in 2019~2022.

The removal of the GKI riscv64 code is because Qualcomm's implementation is incompatible with mainline riscv64.

RVA24 is better than most x86-64 and AArch64 ISAs in the market. Which only has these instructions being mandatory:
"The following are new development options intended to become mandatory in RVA24U64:
• Zabha Byte and Halfword Atomic Memory Operations
• Zacas Compare-and-swap
• Ziccamoc Main memory regions with both the cacheability and coherence PMAs must provide
AMOCASQ level PMA support.
• Zvbc Vector carryless multiply.
• Zama16b Misaligned loads, stores, and AMOs to main memory regions that do not cross a
naturally aligned 16-byte boundary are atomic."

The only other spec of value is OS-A Common and OS-A Server. Which was already built on RVA22 but will be finalized with RVA24.

For example;
c930.jpeg
 
Last edited:
  • Wow
Reactions: Grazick

camel-cdr

Member
Feb 23, 2024
32
103
66
The removal of the GKI riscv64 code is because Qualcomm's implementation is incompatible with mainline riscv64
I don't think this is true, considering Qualcomm's involve in things like the scalar efficiency SIG, here they even propose some 48-bit instructions: https://docs.google.com/spreadsheet...p9vVvVjS6Jz9vGWhwmsdbEOF3JBwUg/htmlview#gid=0

what is true is:

* nothing has changed with our work on Android/riscv64 support in AOSP

* we've stopped producing ACK/GKI builds for now

* until there is an official GKI kernel, we're working on
transitioning to a kernel that we -- the folks working on
Android/riscv64 -- maintain...

* ...but unfortunately the GKI changes went out before our changes are ready

note that the "non-GKI" kernel will still be to all intents and
purposes an ACK/GKI kernel (with the aim that Android/riscv64 devices
will use GKI kernels), but since maintenance of an officially
_labelled_ GKI kernel is more expensive, we're removing the sticker
for now.

On other news, SpacemiT K1 looking good so far: https://github.com/pigirons/cpufp?tab=readme-ov-file#spacemit-k18-x-spacemit-x60

The 2x vector compute over A55 seems to be true:


Code:
$ ./cpufp --thread_pool=[0] # Spacemit X60
Number Threads: 1
Thread Pool Binding: 0
---------------------------------------------------------------
| Instruction Set | Core Computation       | Peak Performance |
| ime             | vmadot(s32,s8,s8)      | 511.53 GOPS      |
| ime             | vmadotu(u32,u8,u8)     | 511.5 GOPS       |
| ime             | vmadotus(s32,u8,s8)    | 511.53 GOPS      |
| ime             | vmadotsu(s32,s8,u8)    | 511.51 GOPS      |
| ime             | vmadotslide(s32,s8,s8) | 511.51 GOPS      |
| vector          | vfmacc.vf(f16,f16,f16) | 66.722 GFLOPS    |
| vector          | vfmacc.vv(f16,f16,f16) | 63.936 GFLOPS    |
| vector          | vfmacc.vf(f32,f32,f32) | 33.36 GFLOPS     |
| vector          | vfmacc.vv(f32,f32,f32) | 31.968 GFLOPS    |
| vector          | vfmacc.vf(f64,f64,f64) | 16.679 GFLOPS    |
| vector          | vfmacc.vv(f64,f64,f64) | 15.985 GFLOPS    |
---------------------------------------------------------------
$ ./cpufp --thread_pool=[0] # Cortex-A55
Number Threads: 1
Thread Pool Binding: 0
----------------------------------------------------------------
| Instruction Set | Core Computation        | Peak Performance |
| asimd_dp        | dp4a.vs(s32,s8,s8)      | 58.305 GOPS      |
| asimd_dp        | dp4a.vv(s32,s8,s8)      | 58.311 GOPS      |
| asimd_dp        | dp4a.vs(u32,u8,u8)      | 58.313 GOPS      |
| asimd_dp        | dp4a.vv(u32,u8,u8)      | 58.311 GOPS      |
| asimd_hp        | fmla.vs(fp16,fp16,fp16) | 29.156 GFLOPS    |
| asimd_hp        | fmla.vv(fp16,fp16,fp16) | 29.156 GFLOPS    |
| asimd           | fmla.vs(f32,f32,f32)    | 14.579 GFLOPS    |
| asimd           | fmla.vv(f32,f32,f32)    | 14.577 GFLOPS    |
| asimd           | fmla.vs(f64,f64,f64)    | 7.2891 GFLOPS    |
| asimd           | fmla.vv(f64,f64,f64)    | 7.2834 GFLOPS    |

It's also 10x INT8 performance using their custom matrix extension.
 

camel-cdr

Member
Feb 23, 2024
32
103
66
Thanks a lot for sharing :)


What are the respective frequencies?

Also I guess the K1 is simulated while the A55 is a real platform. And I guess your loops don't depend on memory?
No the K1 is real, you can order it on aliexpress now, but sadly not in Germany for now :-( See the banana pi BPI-F3.
I'm not sure about the frequency, but the geekbench scores list it at 1.6GHz.
 

camel-cdr

Member
Feb 23, 2024
32
103
66
On that topic, XiangShans RVV backend still has some problems with my benchmarks, but there have been a few other things I've noticed:
  • There is a new branch that separates the float and vector pipelines
  • New SPECint 2006 numbers have been published, looks quite good so far:
    1.png
  • A new PR implements Zicond, this only took ~40 lines of code, and I found it quite interesting to look at.
  • Similarly, here is how the Zvbb (vector bitmanip) extension was implemented a while back, it took ~450 lines of code: functional unit changes, main repo changes
Other open-source RVV implementations also had some updates:
  • IntelLabs darecreek implementation published a small design description. "So far, the arithmetic functional units are sufficiently tested. Other functions such as load/store and control flow only passed basic test", so it's still very much in progress. Hopefully it will be ready enough to attach it to rocket chip soon.
  • t1, is now had a public beta release, you can play with it using "docker run --name t1 -it -v $PWD:/workspace --rm ghcr.io/chipsalliance/t1-blastoise:latest /bin/bash" and run a program with "ip-emulator --no-logging -C yourProgram". It currently uses spike to execute the scalar instructions. Last time I tried it, I couldn't get my benchmarks to run, I'll have to look into it again.
 

Attachments

  • 1.png
    1.png
    409.4 KB · Views: 8

camel-cdr

Member
Feb 23, 2024
32
103
66
So, I just tried running the new OpenXiangShan backend again, and it seems to work except for vrgather.vv, so I've got some benchmarks against my 1600X desktop for y'all:


The benchmark:
  • The measurements are from the simdutf vectorized utf8 to utf16 conversion routines, using my PR for the RVV implementation.
  • Both vectorized versions assume valid input and only bounds checks, because utf8 validation requires vrgather.vv in RVV and that currently doesn't work in XiangShan.
  • The XiangShan results are from the DefaultConfig.
  • The results were averaged on x86, and just one sample on XiangShan, because it was running using verilog simulation, which is incredibly slow.
  • The capitalized inputs are from the lipsum dataset, which contains lore ipsum style text, this quite regular. The others are the source code of wikipedia entries in the respective languages and are closer to real world data.
  • The numbers are in input bytes/cycle, so the bigger, the better. You can multiply the numbers by clock frequency to get approximately GB/s.

XiangShan scalar RVV speedup
Latin 0.919203 1.218785 1.33x
Japanese 0.239199 0.532492 2.23x
Hebrew 0.148244 0.691389 4.66x
Korean 0.187919 0.504613 2.69x
Emoji 0.302343 0.324324 1.07x
german 0.596167 0.940519 1.58x
japanese 0.292013 0.624463 2.14x
arabic 0.243619 0.801790 3.29x

1600X scalar AVX2 speedup
Latin 3.444410 5.196881 1.51x
Japanese 0.274903 1.132911 4.12x
Hebrew 0.186775 0.722549 3.87x
Korean 0.219586 0.700254 3.19x
Emoji 0.294633 0.459388 1.56x
german 0.686341 1.766784 2.57x
japanese 0.465766 0.879507 1.89x
arabic 0.394321 0.914913 2.32x

Note that this is very specific hand vectorized code for both processors.
While the 1600X has AVX2 with 256-bit per register, and XiangShan only 128, keep in mind that RVV has some more expressive/feature rich instructions.
Particularly vcompress is interesting for the implementation and the AVX512 version does make use of their byte compress instruction.
A small update on this: It turns out the performance characteristics change drastically, when enabling DRAMsim3 in the simulation:

Code:
XiangShan master 2024-06-02 with DRAMsim3:
Latin    scalar: 0.414036 b/c rvv: 3.081664 b/c speedup:  7.442989x
Hebrew   scalar: 0.118154 b/c rvv: 1.003512 b/c speedup:  8.493226x
Japanese scalar: 0.108959 b/c rvv: 1.006036 b/c speedup:  9.233149x
Korean   scalar: 0.095818 b/c rvv: 1.009591 b/c speedup: 10.536598x
Emoji    scalar: 0.107764 b/c rvv: 1.002758 b/c speedup:  9.305089x


Ryzen-1600x:
Latin    scalar: 1.3779494 b/c avx2: 10.6266082 b/c speedup:  7.7118997x
Hebrew   scalar: 0.2273493 b/c avx2:  2.4073338 b/c speedup: 10.5887005x
Japanese scalar: 0.2694193 b/c avx2:  1.5752458 b/c speedup:  5.8468165x
Korean   scalar: 0.1994506 b/c avx2:  1.5092624 b/c speedup:  7.5670958x
Emoji    scalar: 0.3975124 b/c avx2:  0.4021987 b/c speedup:  1.0117890x

It still looks like it's cache/memory bound compared to Zen1. For the "Latin" input the code basically just does a narrowing memcpy, so we should be completely memory bound in that, and Zen1 is 3x faster here.
You should probably disregard the Emoji benchmark, because it includes just emojis, and the RVV code has a special case that disproportionately benefits from that.

On other news: https://www.newelectronics.co.uk/co...ology-announces-new-soc-and-development-board
Andes Technology announces new SoC and development board [...]
The QiLai SoC chip includes a high-performance quad-core RISC-V AX45MP cluster and one NX27V vector processor [...]
It also contains an efficient scalar unit and an out-of-order Vector Processing Unit (VPU) with 512-bit vector length (VLEN) and 512-bit data path width (DLEN), capable of generating up to 4 512-bit results per cycle
The AX45MP isn't all that interesting, but the NX27V looks really powerful. Both are a bit older already, but it's nice that we'll have a devboard soon.

See this video for more detail on the architecture:

1.png2.png

Looks like we'll see a demo at RISC-V summit Europe at the end of this month:
A platform test-chip based on Andes RISC-V multiprocessor AX45MP and RVV vector processor, designed and manufactured in TSMC 7nm process, along with its evaluation board will be demonstrated
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
cuzco.jpg
Andes announced Cuzco core/series, 10-ish days ago
Minimum base clock rate of AX65 is 2.4 GHz. With AX66 coming out first-half 2025.

In 2020 they announced searching for customers in "HPC, PC, Mobile Phone, AI, and Cloud." With them finding a customer for Mobile Phones for AX60+/Cuzco series in 2021+. They have yet to announce who the partner it is. But, it could easily be number one: Mediatek. AX66+Nvidia iGPU+MediaTek SoC for example. Where it can fit into Dimensity 6000 to Helio G90 series which have launched all the way up to Q1 2024 w/ Cortex-A75 cores.
 
Last edited:

eek2121

Diamond Member
Aug 2, 2005
3,415
5,054
136
Tenstorrent’s dev kit with their RISC-V design is out. They are claiming > 18 SpecINT 2006/ghz.
 

SarahKerrigan

Senior member
Oct 12, 2014
735
2,036
136
Tenstorrent’s dev kit with their RISC-V design is out. They are claiming > 18 SpecINT 2006/ghz.

Link? I'm seeing announcements of Wormhole but nothing to suggest it's using the large Ascalon configuration. (Their website is surprisingly devoid of details on what Wormhole actually contains.)

EDIT: It looks like Wormhole only uses RV for the Tensix cores. The control cores aren't RV at all, let alone Ascalon; they're ARC. So I'm curious where you saw that anything in Wormhole was >18SPECint06/GHz.
 
Last edited:
  • Like
Reactions: Nothingness

Nothingness

Diamond Member
Jul 3, 2013
3,301
2,373
136
Link? I'm seeing announcements of Wormhole but nothing to suggest it's using the large Ascalon configuration. (Their website is surprisingly devoid of details on what Wormhole actually contains.)

EDIT: It looks like Wormhole only uses RV for the Tensix cores. The control cores aren't RV at all, let alone Ascalon; they're ARC. So I'm curious where you saw that anything in Wormhole was >18SPECint06/GHz.
18 specint/ghz @500 MHz :p

Edit: if they really had a powerful R-V processor they would not propose workstations based on x86 to host their devkit.
 
  • Haha
Reactions: igor_kavinski

Nothingness

Diamond Member
Jul 3, 2013
3,301
2,373
136
Why are they using Spec2006, instead of the more relevant Spec2017?
1. Because they don't have money to pay the license
2. Because they want to be sure they can't be compared to modern cores.

And joke aside, I don't know. The only thing I'd add is if I had to choose a (Western) company that could at last deliver a good R-V core, I'd put my bet on Tenstorrent.
 
  • Like
Reactions: igor_kavinski

SarahKerrigan

Senior member
Oct 12, 2014
735
2,036
136
1. Because they don't have money to pay the license
2. Because they want to be sure they can't be compared to modern cores.

And joke aside, I don't know. The only thing I'd add is if I had to choose a (Western) company that could at last deliver a good R-V core, I'd put my bet on Tenstorrent.

I think I'd expect more from Sifive than Tenstorrent - but I continue to look at both with a certain skepticism, especially when it comes to being able to continue delivering reliably across generations.
 

Nothingness

Diamond Member
Jul 3, 2013
3,301
2,373
136
I think I'd expect more from Sifive than Tenstorrent - but I continue to look at both with a certain skepticism, especially when it comes to being able to continue delivering reliably across generations.
Hasn't SiFive dropped the ball and are now only doing custom designs for some custormers?
 

SarahKerrigan

Senior member
Oct 12, 2014
735
2,036
136
Hasn't SiFive dropped the ball and are now only doing custom designs for some custormers?

AFAIK that rumor was false. P870 was the most recent core and there's a successor for it roadmapped.

Going off the HC presentation, P870 seems quite fierce, though the are some games going on in how they're measuring structure sizes.
 

Nothingness

Diamond Member
Jul 3, 2013
3,301
2,373
136
AFAIK that rumor was false. P870 was the most recent core and there's a successor for it roadmapped.

Going off the HC presentation, P870 seems quite fierce, though the are some games going on in how they're measuring structure sizes.
That rumor might have been false but all the people I knew who worked there left (except for 3).