AMD Announces Full Support For Llama 3.1 AI Models Across EPYC CPUs, Instinct Accelerators, Ryzen AI NPUs & Radeon GPUs

Makaveli

Diamond Member
Feb 8, 2002
4,916
1,504
136
Last edited:

Makaveli

Diamond Member
Feb 8, 2002
4,916
1,504
136
Last edited:

marees

Golden Member
Apr 28, 2024
1,000
1,339
96

Makaveli

Diamond Member
Feb 8, 2002
4,916
1,504
136
  • Like
Reactions: DAPUNISHER

coercitiv

Diamond Member
Jan 24, 2014
7,118
16,475
136
14B model works perfectly fine with my 6600XT 8GB in LMStudio. AMD under promoting themself.
It runs, but not perfectly fine. You need to use partial GPU offload to make them work, since they use more than 8GB of VRAM. This cuts down inferencing speed, and with this new trend of "thinking", LLM inferencing speed is more important than ever, as the LLM will first have to generate quite a lot of tokens before it's ready to formulate an answer.

It might have been a better idea to show them as "Recommended" instead of "Max Supported", not everyone will read the footnotes. Still, it's a good decision to point people towards the max parameter version that fits in VRAM, it leads to a better first user experience and highlights the difference between cards.
 

JustViewing

Senior member
Aug 17, 2022
267
470
106
It runs, but not perfectly fine. You need to use partial GPU offload to make them work, since they use more than 8GB of VRAM. This cuts down inferencing speed, and with this new trend of "thinking", LLM inferencing speed is more important than ever, as the LLM will first have to generate quite a lot of tokens before it's ready to formulate an answer.

It might have been a better idea to show them as "Recommended" instead of "Max Supported", not everyone will read the footnotes. Still, it's a good decision to point people towards the max parameter version that fits in VRAM, it leads to a better first user experience and highlights the difference between cards.
Maybe, but still good enough. It generates text faster than normal reading speed.
 
  • Like
Reactions: igor_kavinski

coercitiv

Diamond Member
Jan 24, 2014
7,118
16,475
136
Maybe, but still good enough. It generates text faster than normal reading speed.
I disagree. Moving from full (32/32) to partial (28/32) GPU offload on my 6800XT /w 12700K DDR4 3600 cuts speed from 50 tokens/s to just 20. In my experience this is fine for previous gen LLMs, but not with the new gen ones.

Here's how an R1 distill formulates an answer:
1738225226347.png

Notice the "thoughts" rectangle that takes 15s to complete in this case @ 50 tok/sec. With 20 tok/sec speed it would take 35s+. Writing speed is still good but delay before the first word has increased by 20 seconds on top of the first 15.

Obviously not all answer take this much "thinking", some are quite quick.
 
  • Like
Reactions: igor_kavinski

JustViewing

Senior member
Aug 17, 2022
267
470
106
I disagree. Moving from full (32/32) to partial (28/32) GPU offload on my 6800XT /w 12700K DDR4 3600 cuts speed from 50 tokens/s to just 20. In my experience this is fine for previous gen LLMs, but not with the new gen ones.

Here's how an R1 distill formulates an answer:

Notice the "thoughts" rectangle that takes 15s to complete in this case @ 50 tok/sec. With 20 tok/sec speed it would take 35s+. Writing speed is still good but delay before the first word has increased by 20 seconds on top of the first 15.

Obviously not all answer take this much "thinking", some are quite quick.
I don't disagree with your statement. However I guess what we view as acceptable speed is different :) . For 8B model, I am getting 40Tokens/sec, for 14B, 7 tokens/sec.

Thing I most love about these models is not the actual answer, but the the thought process. It is very insightful.
 

JustViewing

Senior member
Aug 17, 2022
267
470
106
Ryzen 5950X does 7 tokens/sec for 8B and 4 tokens/sec for 14B models. LM Studio seems to limit CPU thread to 16. As I remember earlier I was able to set 32.
 
  • Like
Reactions: Makaveli

coercitiv

Diamond Member
Jan 24, 2014
7,118
16,475
136
Does the fan ramp up noticeably when it's "thinking"?
I never had the fan noticeably ramp up during normal use, but my usage is sparse and bursty. However, with some models I had noticeable coil whine, which never happens in gaming. This card has some type of coil whine while the system idles in UEFI, but none during normal use.

Here's how it looks at stock with a prompt that made it "think" a bit longer, the fan ramped up to 1100 rpm after the screenshot:
1738257090633.png

Here's my usual gaming underclock & undervolt:
1738257227352.png

At first sight the higher stock clocks don't seem to help inference speed much, though this is the first time I look at this. I would expect memory speed to help, obviously. Maybe also latency? Reminds me of crypto mining :D
 
  • Like
Reactions: igor_kavinski
Jul 27, 2020
24,114
16,826
146
Does LM Studio allow RAG (giving the chatbot document(s) and making it answer question about specific doc or category of documents)?
 

JustViewing

Senior member
Aug 17, 2022
267
470
106
I think for future models they should separate out "Intelligence" and "Knowledge" parts, rather than ever increasing model sizes. Knowledge part can be in terabytes and should not require training, just only need processing to a machine optimized format/ database. Something like Brain + Library/ Internet. Something like RAG but in grand scale. Home users can use smaller knowledge base.
 
  • Like
Reactions: moinmoin and marees
Jul 27, 2020
24,114
16,826
146
I was getting failure messages with Deepseek Coder Lite. Unloaded it, clicked to reload it in the top bar and it showed me settings. Maxed everything out and really love the result!

1738353578703.png

With that much RAM used for useful results, don't think any GPU will be of much use until maybe 5 years from now?

@Red Squirrel maybe now I can create that Xitter clone before you! :p
 

JustViewing

Senior member
Aug 17, 2022
267
470
106
I was getting failure messages with Deepseek Coder Lite. Unloaded it, clicked to reload it in the top bar and it showed me settings. Maxed everything out and really love the result!

View attachment 116030

With that much RAM used for useful results, don't think any GPU will be of much use until maybe 5 years from now?

@Red Squirrel maybe now I can create that Xitter clone before you! :p
What is the size of the model you are using? With 384GB, you could try large models.
 
Jul 27, 2020
24,114
16,826
146
What is the size of the model you are using? With 384GB, you could try large models.
Deepseek Coder Lite is 10GB which is almost useless at default settings. Max everything out and it balloons to 50+ GB in memory consumption but then it understands and thinks better and gives me exactly what I ask it. I think I'm fairly satisfied at the moment. Will try to be more adventurous once I hit a bump or something.