AMD Announces Full Support For Llama 3.1 AI Models Across EPYC CPUs, Instinct Accelerators, Ryzen AI NPUs & Radeon GPUs

Makaveli · Jul 23, 2024

AMD Announces Full Support For Llama 3.1 AI Models Across EPYC CPUs, Instinct Accelerators, Ryzen AI NPUs & Radeon GPUs

AMD has announced full Llama 3.1 AI model support across its entire portfolio including EPYC, Instinct, Ryzen & Radeon.

wccftech.com

I just got this up and running on my radeon pretty cool.

LM Studio download

LM Studio - Discover, download, and run local LLMs

Run Llama, Gemma 3, DeepSeek locally on your computer.

lmstudio.ai

igor_kavinski · Jul 23, 2024

If ROCm is being used, minimum RX 6800 is required. However, runtime is supported all the way down to RX 6600 so it might work?

System requirements (Windows) — HIP SDK installation (Windows)

Windows GPU and OS support

rocm.docs.amd.com

Makaveli · Jul 23, 2024

igor_kavinski said:
If ROCm is being used, minimum RX 6800 is required. However, runtime is supported all the way down to RX 6600 so it might work?

System requirements (Windows) — HIP SDK installation (Windows)

Windows GPU and OS support

rocm.docs.amd.com

View attachment 103678

A guy on the discord told me he was running it on a 5700XT so RDNA 1 on windows.

*Update*
Vulcan support added is version 0.2.31 so should work with gpu's older than RDNA 2.

marees · Jan 29, 2025

DeepSeek R1 (inference) setup without any (costly) GPU

2x AMD EPYC 9004 or 9005 CPU
768gb RAM
(Speed: upto 7 tokens/sec)

https://twitter.com/x/status/1884244369907278106

some quick calculation:

💡6-8 tokens/s
💡400W power unit
💡~$0.176 per kWh

To generate 1m tokens:

~40h * 0.4kW --> 16 kWh
16 kWh * $0.176 --> $2.82

$2.82 in electricity costs to generate 1m tokens. Not bad!

https://twitter.com/x/status/1884320857331490849

Makaveli · Jan 29, 2025

Experience the DeepSeek R1 Distilled 'Reasoning' Models on AMD Ryzen™ AI and Radeon™

Reasoning models are a new class of large language models (LLMs) designed to tackle highly complex tasks by employing chain-of-thought (CoT) reasoning with the tradeoff of taking longer to respond. The DeepSeek R1 is a recently released frontier “reasoning” model which has been distilled into...

community.amd.com

JustViewing · Jan 30, 2025

14B model works perfectly fine with my 6600XT 8GB in LMStudio. AMD under promoting themself.

coercitiv · Jan 30, 2025

JustViewing said:
14B model works perfectly fine with my 6600XT 8GB in LMStudio. AMD under promoting themself.

It runs, but not perfectly fine. You need to use partial GPU offload to make them work, since they use more than 8GB of VRAM. This cuts down inferencing speed, and with this new trend of "thinking", LLM inferencing speed is more important than ever, as the LLM will first have to generate quite a lot of tokens before it's ready to formulate an answer.

It might have been a better idea to show them as "Recommended" instead of "Max Supported", not everyone will read the footnotes. Still, it's a good decision to point people towards the max parameter version that fits in VRAM, it leads to a better first user experience and highlights the difference between cards.

JustViewing · Jan 30, 2025

coercitiv said:
It runs, but not perfectly fine. You need to use partial GPU offload to make them work, since they use more than 8GB of VRAM. This cuts down inferencing speed, and with this new trend of "thinking", LLM inferencing speed is more important than ever, as the LLM will first have to generate quite a lot of tokens before it's ready to formulate an answer.

It might have been a better idea to show them as "Recommended" instead of "Max Supported", not everyone will read the footnotes. Still, it's a good decision to point people towards the max parameter version that fits in VRAM, it leads to a better first user experience and highlights the difference between cards.

Maybe, but still good enough. It generates text faster than normal reading speed.

coercitiv · Jan 30, 2025

JustViewing said:
Maybe, but still good enough. It generates text faster than normal reading speed.

I disagree. Moving from full (32/32) to partial (28/32) GPU offload on my 6800XT /w 12700K DDR4 3600 cuts speed from 50 tokens/s to just 20. In my experience this is fine for previous gen LLMs, but not with the new gen ones.

Here's how an R1 distill formulates an answer:

Notice the "thoughts" rectangle that takes 15s to complete in this case @ 50 tok/sec. With 20 tok/sec speed it would take 35s+. Writing speed is still good but delay before the first word has increased by 20 seconds on top of the first 15.

Obviously not all answer take this much "thinking", some are quite quick.

JustViewing · Jan 30, 2025

coercitiv said:
I disagree. Moving from full (32/32) to partial (28/32) GPU offload on my 6800XT /w 12700K DDR4 3600 cuts speed from 50 tokens/s to just 20. In my experience this is fine for previous gen LLMs, but not with the new gen ones.

Here's how an R1 distill formulates an answer:

View attachment 115908

Notice the "thoughts" rectangle that takes 15s to complete in this case @ 50 tok/sec. With 20 tok/sec speed it would take 35s+. Writing speed is still good but delay before the first word has increased by 20 seconds on top of the first 15.

Obviously not all answer take this much "thinking", some are quite quick.

I don't disagree with your statement. However I guess what we view as acceptable speed is different

. For 8B model, I am getting 40Tokens/sec, for 14B, 7 tokens/sec.

Thing I most love about these models is not the actual answer, but the the thought process. It is very insightful.

Makaveli · Jan 30, 2025

Performance has been good I've tested the 14B and 32B models i'm still on the 24.12.1 drivers and using the ROCm 1.10.0 version

I'm seeing 56 tok/sec on the 14B model.

And 28 tok/sec on the 32B model

igor_kavinski · Jan 30, 2025

coercitiv said:
Moving from full (32/32) to partial (28/32) GPU offload on my 6800XT /w 12700K DDR4 3600 cuts speed from 50 tokens/s to just 20.

Does the fan ramp up noticeably when it's "thinking"?

Makaveli · Jan 30, 2025

igor_kavinski said:
Does the fan ramp up noticeably when it's "thinking"?

It should as I see about 96% GPU usage and 388W of power usage while its thinking and providing the answer.

JustViewing · Jan 30, 2025

igor_kavinski said:
Does the fan ramp up noticeably when it's "thinking"?

For me with 6600XT it takes around 130W for 8B, and for 14B 30W-50W. Yes larger model takes less power, probably because of swapping data in and out.

JustViewing · Jan 30, 2025

Ryzen 5950X does 7 tokens/sec for 8B and 4 tokens/sec for 14B models. LM Studio seems to limit CPU thread to 16. As I remember earlier I was able to set 32.

coercitiv · Jan 30, 2025

igor_kavinski said:
Does the fan ramp up noticeably when it's "thinking"?

I never had the fan noticeably ramp up during normal use, but my usage is sparse and bursty. However, with some models I had noticeable coil whine, which never happens in gaming. This card has some type of coil whine while the system idles in UEFI, but none during normal use.

Here's how it looks at stock with a prompt that made it "think" a bit longer, the fan ramped up to 1100 rpm after the screenshot:

Here's my usual gaming underclock & undervolt:

At first sight the higher stock clocks don't seem to help inference speed much, though this is the first time I look at this. I would expect memory speed to help, obviously. Maybe also latency? Reminds me of crypto mining

igor_kavinski · Jan 30, 2025

Does LM Studio allow RAG (giving the chatbot document(s) and making it answer question about specific doc or category of documents)?

coercitiv · Jan 30, 2025

Chat with Documents | LM Studio Docs

How to provide local documents to an LLM as additional context

lmstudio.ai

Makaveli · Jan 30, 2025

*snip*

Answered above.

JustViewing · Jan 30, 2025

I think for future models they should separate out "Intelligence" and "Knowledge" parts, rather than ever increasing model sizes. Knowledge part can be in terabytes and should not require training, just only need processing to a machine optimized format/ database. Something like Brain + Library/ Internet. Something like RAG but in grand scale. Home users can use smaller knowledge base.

igor_kavinski · Jan 31, 2025

Good news for me.

Don't really need to bother with a GPU. Speed is sufficient and it was at max 41% CPU utilization.

igor_kavinski · Jan 31, 2025

I was getting failure messages with Deepseek Coder Lite. Unloaded it, clicked to reload it in the top bar and it showed me settings. Maxed everything out and really love the result!

With that much RAM used for useful results, don't think any GPU will be of much use until maybe 5 years from now?

@Red Squirrel maybe now I can create that Xitter clone before you!

Red Squirrel · Jan 31, 2025

I'd love to get into locally hosted AI, but yeah, not quite in my budget these days.

JustViewing · Feb 1, 2025

igor_kavinski said:
I was getting failure messages with Deepseek Coder Lite. Unloaded it, clicked to reload it in the top bar and it showed me settings. Maxed everything out and really love the result!

View attachment 116030

With that much RAM used for useful results, don't think any GPU will be of much use until maybe 5 years from now?

@Red Squirrel maybe now I can create that Xitter clone before you!

What is the size of the model you are using? With 384GB, you could try large models.

igor_kavinski · Feb 1, 2025

JustViewing said:
What is the size of the model you are using? With 384GB, you could try large models.

Deepseek Coder Lite is 10GB which is almost useless at default settings. Max everything out and it balloons to 50+ GB in memory consumption but then it understands and thinks better and gives me exactly what I ask it. I think I'm fairly satisfied at the moment. Will try to be more adventurous once I hit a bump or something.

AMD Announces Full Support For Llama 3.1 AI Models Across EPYC CPUs, Instinct Accelerators, Ryzen AI NPUs & Radeon GPUs

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Senior member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Senior member

Lifer

Lifer

No Lifer

Senior member

Lifer