News NVIDIA and MediaTek want to bring RTX graphics to ARM laptops

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NTMBK

Lifer
Nov 14, 2011
10,474
5,886
136
What would an ARM-based laptop look like with an RTX graphics card? That's something NVIDIA is exploring together with MediaTek, a company best known for building ARM-based chips. Together, they're building a reference laptop platform that'll support Chromium, Linux and NVIDIA SDKs (software development kits). While it's unclear what, if anything, this partnership will lead to, it's not hard to get excited about the idea of a next-generation Chromebook that's light, energy efficient and equipped with NVIDIA's ray-tracing RTX hardware, even if they inevitably end up being stripped down.


I don't really get why Mediatek is involved here? NVidia can clearly produce their own ARM SoCs. And I'm not taking it seriously until it runs Windows.

BUT... I can definitely see a future where an ARM laptop chip can emulate x86 games faster than an x86 chip in the same power envelope. And at that point, this could be viable.
 
  • Like
Reactions: SarahKerrigan

LightningZ71

Platinum Member
Mar 10, 2017
2,606
3,290
136
Similar case as AMD's advertisements on RDNA 3 performance. It required special circumstances to achieve it's advertised throughput, and those special cases were far from common. Same thing here for those sparse situations.

As for power draw, it would be interesting to see what the utilization on all the processing units is and their frequencies when it's under load. I conjecture that there are portions of the chip that aren't seeing optimum usage or delays somewhere that are resulting in various parts having to stop and wait for data or instructions.
 
  • Like
Reactions: marees

Aeonsim

Junior Member
May 10, 2020
19
53
91
Of course that wouldn't explain spontaneous reboots...

I've got access couple, so far the instability I've seen has been around memory management. NVIDIA's drivers don't seem to be particularly good at managing unified memory. Have had the device freeze solid or lag like crazy when a couple of different processes requested a mix of memory for GPU use and CPU use (loading gpt-oss:120B + web-browser + another small gpu task). Rather than OOM killing a process the machine just seems locked up. Secondly there are official notes instructing you to manually tell the kernel to release OS cached memory (sudo sh -c 'sync; echo 3 > proc/sys/vm/drop_caches') if your jobs fail to allocate enough memory even though your memory request is well under the max the system supports.

They've also posted that the official TDP of the SOC is supposed to be 140W (which is possibly with full CPU and GPU load), the additional 100W of the 240W total is for NIC, SSD and USB-C.
 

Aeonsim

Junior Member
May 10, 2020
19
53
91
Also a few other observations:
  • The Asus version of the system also seems to have pretty much a silent fan profile.
  • I personally also had issues with Docker not working initially, had to go wipe a cache file to get it up and running straight after first boot.
  • htop reports peak frequencies of ~4.2Ghz with heavy simd code, though the cpu it's self reports a max frequency of 3.9Ghz for a x925 cores.
 

poke01

Diamond Member
Mar 8, 2022
4,469
5,792
106
htop reports peak frequencies of ~4.2Ghz with heavy simd code, though the cpu it's self reports a max frequency of 3.9Ghz for a x925 cores.
Thanks for report. What’s the package power of the CPU under your workflow?
 

Aeonsim

Junior Member
May 10, 2020
19
53
91
Thanks for report. What’s the package power of the CPU under your workflow?
Not sure, I've not found a good way of measuring power on the device yet.

Stream reports memory bandwidth of 120-170GBs when running on all cores.

tinymembench gives 89GBs for a single core with NEON and the following latencies.

Code:
Latency for a x925
block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    0.0 ns          /     0.0 ns
    131072 :    0.9 ns          /     1.3 ns
    262144 :    1.3 ns          /     1.6 ns
    524288 :    1.6 ns          /     1.7 ns
   1048576 :    1.7 ns          /     1.7 ns
   2097152 :    1.9 ns          /     1.9 ns
   4194304 :    8.9 ns          /    12.5 ns
   8388608 :   12.3 ns          /    15.4 ns
  16777216 :   14.3 ns          /    16.8 ns
  33554432 :   60.3 ns          /    85.1 ns
  67108864 :   85.9 ns          /   105.9 ns
 

poke01

Diamond Member
Mar 8, 2022
4,469
5,792
106
Not sure, I've not found a good way of measuring power on the device yet.

Stream reports memory bandwidth of 120-170GBs when running on all cores.

tinymembench gives 89GBs for a single core with NEON and the following latencies.

Code:
Latency for a x925
block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    0.0 ns          /     0.0 ns
    131072 :    0.9 ns          /     1.3 ns
    262144 :    1.3 ns          /     1.6 ns
    524288 :    1.6 ns          /     1.7 ns
   1048576 :    1.7 ns          /     1.7 ns
   2097152 :    1.9 ns          /     1.9 ns
   4194304 :    8.9 ns          /    12.5 ns
   8388608 :   12.3 ns          /    15.4 ns
  16777216 :   14.3 ns          /    16.8 ns
  33554432 :   60.3 ns          /    85.1 ns
  67108864 :   85.9 ns          /   105.9 ns
thank you. can you also run this bench, all threads is fine.

 

Aeonsim

Junior Member
May 10, 2020
19
53
91
That benchmark is fairly broken, only compiles without edits on my Mac at the moment. With a few edits managed to get it running, but it's not doing a particularly good job of detecting things accurately (ie LPDDR4 3200 instead of LPDDR5 8533). So not sure how accurate any of it's values will be.

Code:
./memory_bandwidth
# System Information

- **CPU:** ARM Processor (implementer: 0x65, part: 0x3461) ✓
- **Total RAM:** 119 GB ✓
- **Available RAM:** 109 GB ✓
- **Physical CPU Cores:** 20 ✓
- **Logical CPU Threads:** 20 ✓

## Memory Specifications

- **Architecture:** ARM64 Architecture
- **Type:** LPDDR4 ✓
- **Speed:** 3200 MT/s ✓
- **Data Width:** 64 bits
- **Total Width:** 64 bits
- **Channels:** 2 (not detected from system)
- **Theoretical Bandwidth:** 51.2 GB/s (409.6 Gb/s) ✓

## Cache Information

- **L1 Data Cache:** 64 KB per core ✓
- **L1 Instruction Cache:** 64 KB per core ✓
- **L2 Cache:** 2048 KB per core ✓
- **L3 Cache:** 8 MB shared ✓
- **Cache Line Size:** 64 bytes ✓


=== LARGE MEMORY MODE ===
Testing with large working sets (>4GB) - Natural system performance
No cache interference - let hardware prefetchers and memory controllers work naturally

## Test Results

| Test | Working Set | Threads | Bandwidth (Gb/s) | Latency (ns) | Efficiency (%) |
|------|-------------|---------|------------------|--------------|----------------|
| Sequential Read | 6GB | 20 | 806.86 ⚠️  | 0.6 | 197.0 |
| Sequential Write | 6GB | 20 | 656.67 ⚠️  | 0.8 | 160.3 |
| Random Read | 6GB | 20 | 129.59 | 4.0 | 31.6 |
| Random Write | 6GB | 20 | 545.32 ⚠️  | 0.9 | 133.1 |
| Copy | 6GB | 20 | 966.19 ⚠️  | 0.5 | 235.9 |
| Triad | 6GB | 20 | 945.88 ⚠️  | 0.5 | 230.9 |
| Matrix Multiply (GEMM) | 6GB | 20 | 4.06 | 126.1 | 1.0 |

## Test Complete

All memory bandwidth tests have been completed successfully.
 
  • Like
Reactions: poke01

poke01

Diamond Member
Mar 8, 2022
4,469
5,792
106
That benchmark is fairly broken, only compiles without edits on my Mac at the moment. With a few edits managed to get it running, but it's not doing a particularly good job of detecting things accurately (ie LPDDR4 3200 instead of LPDDR5 8533). So not sure how accurate any of it's values will be.

Code:
./memory_bandwidth
# System Information

- **CPU:** ARM Processor (implementer: 0x65, part: 0x3461) ✓
- **Total RAM:** 119 GB ✓
- **Available RAM:** 109 GB ✓
- **Physical CPU Cores:** 20 ✓
- **Logical CPU Threads:** 20 ✓

## Memory Specifications

- **Architecture:** ARM64 Architecture
- **Type:** LPDDR4 ✓
- **Speed:** 3200 MT/s ✓
- **Data Width:** 64 bits
- **Total Width:** 64 bits
- **Channels:** 2 (not detected from system)
- **Theoretical Bandwidth:** 51.2 GB/s (409.6 Gb/s) ✓

## Cache Information

- **L1 Data Cache:** 64 KB per core ✓
- **L1 Instruction Cache:** 64 KB per core ✓
- **L2 Cache:** 2048 KB per core ✓
- **L3 Cache:** 8 MB shared ✓
- **Cache Line Size:** 64 bytes ✓


=== LARGE MEMORY MODE ===
Testing with large working sets (>4GB) - Natural system performance
No cache interference - let hardware prefetchers and memory controllers work naturally

## Test Results

| Test | Working Set | Threads | Bandwidth (Gb/s) | Latency (ns) | Efficiency (%) |
|------|-------------|---------|------------------|--------------|----------------|
| Sequential Read | 6GB | 20 | 806.86 ⚠️  | 0.6 | 197.0 |
| Sequential Write | 6GB | 20 | 656.67 ⚠️  | 0.8 | 160.3 |
| Random Read | 6GB | 20 | 129.59 | 4.0 | 31.6 |
| Random Write | 6GB | 20 | 545.32 ⚠️  | 0.9 | 133.1 |
| Copy | 6GB | 20 | 966.19 ⚠️  | 0.5 | 235.9 |
| Triad | 6GB | 20 | 945.88 ⚠️  | 0.5 | 230.9 |
| Matrix Multiply (GEMM) | 6GB | 20 | 4.06 | 126.1 | 1.0 |

## Test Complete

All memory bandwidth tests have been completed successfully.
its broken on Mac too, it reports on my Mac M4 that it has 32 channels lol. But thank you for running it