peak GFLOPs and sustained GFLOPs.

Anarchist420

Diamond Member
Feb 13, 2010
8,645
0
76
www.facebook.com
What are some CPUs with floating point units (any released up to date), if any, that had sustained performance that was 90% or more of its peak performance?

Wasn't the SH-4 used in the Dreamcast rather low (~64%)?

Why do Console architectures tend to have a larger gap between sustained performance and peak performance? Is it to challenge the programmers more?

The SH-4 could've and should've used much lower latency cache, been clocked at 400 MHz had a 256 bit data bus at 125 MHz, and had 32 MB RAM of system RAM to work with. I know that many Dreamcasts had trouble with overheating, but Sega was way too conservative with the clock speeds. It was still more efficient than the PS2 though and pretty much traded blows with it (other than that the Dreamcast couldn't do more than a 16 bit color buffer but the dithering artifacts were really only apparent in progressive scan).

The PS2 was really weak, because it had no built in texture hardware and vector units 0 and 1 were clocked lower than the rest of the CPU. On the upside to all of that, the PS2 has been emulated much more easily than the Dreamcast has been.

Sorry for the retarded thread, but I hope someone here will be okay with it.
 

inf64

Diamond Member
Mar 11, 2011
3,884
4,691
136
I was always under impression that games (especially older ones in times of DC and PS2) were very much integer heavy and didn't really need extremely potent FP units. Even today most games push integer side of the code stressing branch prediction units (while L/S ops are basically one half of executed ops in code).
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
To calculate peak performance, you basically go around and add up all the FP units on the chip and generate a hypothetical situation where all of them just happened to be working on stuff at the same time.

But sustained is a different question when you start looking are real workloads. It suddenly shifts into a story of how well the CPU architecture keeps those FP units fed. Technically sustained gflops can be misleading as well. I'm guessing now but I think most "sustained" benchmarks come from some linpack/dgemm/sgemm type of trace. I think those have very few branch/load/store and full of a tight loop of nearly identical FP instructions. In those cases, a weaker architecture has a better time keeping the FP units fed since you don't need a strong branch predictor or very wide memory bandwidth. Technically it IS a valid workload and so it is a "realistic workload". However some HPC traces don't look or act like linpack.

For everything else, then you have a pile of benchmarks to use to figure out which one most resembles what you'll be doing. :)
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
What are some CPUs with floating point units (any released up to date), if any, that had sustained performance that was 90% or more of its peak performance?

Wasn't the SH-4 used in the Dreamcast rather low (~64%)?

Why do Console architectures tend to have a larger gap between sustained performance and peak performance? Is it to challenge the programmers more?

There isn't some universal sustained FLOP/s number you can measure. You have a maximum possible FLOP/s count then you have the FLOP/s you measure for some particular workload. You can try to compare things with a standard benchmark but without knowing where the number you gave came from it's totally meaningless.

There are a lot of reasons why getting anywhere close to peak FLOP/s can be difficult:

1) You need to do operations other than floating point ones like loads/stores, integer, and control flow and you have a limited amount of fetch/decode bandwidth to do this stuff in parallel with FLOPs.
2) The operations you need aren't a good fit the FLOP execution capability of the device. This is especially true in Dreamcast where a majority of the peak floating point execution capability lied in the dot product/matrix multiplication instructions. So if you just wanted simple FADD or FMUL (or more complex instructions) the rest of the FLOPs would go under-utilized.
3) The operations aren't independent enough, meaning you end up stalling while waiting for them to complete. This is more likely on Dreamcast's CPU which is 100% in-order. Doesn't help that the compilers used also sucked.
4) You waste a lot of cycles on cache misses. Dreamcast doesn't have a very good cache hierarchy, and no reordering capability that lets it execute FLOPs while waiting on cache.

The SH-4 could've and should've used much lower latency cache, been clocked at 400 MHz had a 256 bit data bus at 125 MHz, and had 32 MB RAM of system RAM to work with. I know that many Dreamcasts had trouble with overheating, but Sega was way too conservative with the clock speeds. It was still more efficient than the PS2 though and pretty much traded blows with it (other than that the Dreamcast couldn't do more than a 16 bit color buffer but the dithering artifacts were really only apparent in progressive scan).

Nothing's wrong with the cache's latency, I think maybe you main main RAM latency. But it was using SDRAM like everyone else at the time. When you say 200MHz was too conservative I really don't know where this is coming from other than a vague desire for it to have been higher. How do you know that the first generation CPUs could have really clocked higher nearly 100% of the time and still been within the thermal restrictions of the design? You yourself say they overheated, how could the clocking be conservative if they already can't handle the power consumption?

The PS2 was really weak, because it had no built in texture hardware and vector units 0 and 1 were clocked lower than the rest of the CPU. On the upside to all of that, the PS2 has been emulated much more easily than the Dreamcast has been.

Of course the PS2 had texture hardware. I don't know where this stuff comes from. And Dreamcast emulation reached a higher point much earlier than PS2 emulation and still today has lower system requirements so I don't know where this is coming from either. Dreamcast is a simpler system than PS2, this is entirely what you'd expect.

inf64 said:
I was always under impression that games (especially older ones in times of DC and PS2) were very much integer heavy and didn't really need extremely potent FP units. Even today most games push integer side of the code stressing branch prediction units (while L/S ops are basically one half of executed ops in code).

Well for Dreamcast FPU performance was pretty important since that's where all the geometry transformation & lighting was being done..
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
That's right it could do 8 bit palettized textures. I forgot about that.

You think that it could only do paletted textures? Even PS1 could do 16-bit textures. Look for "GS User's Manual" and the section "Texture Mapping."
 

zephyrprime

Diamond Member
Feb 18, 2001
7,512
2
81
Don't think that the relationship between "peak, sustained" as it pertains flop performance is the same as it is in other domains. For example, when talking about power output, it's not uncommon for the peak power in watts of an amp or power supply to be twice the sustained level. During the peak, the power device is actually working harder but it is limited because it is draining whatever internal buffers it has for power. If it were to run continuously at peak power, it would burn out (but it can't run continuously at peak power anyway).

What I described above is TOTALLY NOT WHAT "peak" gflops means in a processor. In a processor, during "peak" gflops, the processor is not running any faster at all. It is running exactly the same speed as before. What allows for more flops is that the work load got easier. So if you send in a bunch of useless mov instructions to the flop unit, it will perform peak flops but this is only because the workload is so easy. Send in a bunch of arithmetic operations with dependencies and it will be slower. And send in a real workload with branches and int operations and there will be pipeline stalls. That is why peak flop numbers are nearly useless. Pure marketing BS.

The reason why gaming consoles in particular have such a big discrepency between peak and sustained is because their chips are cheap designs with many corners cut that lower real performance.