Lost Planet 2 DX11 - noticeable tessellation

Scali · Aug 19, 2010

AtenRa said:
Programs do need to be HT optimized, and from the table below, its clear that HT is working but it only gives a mare 10% more frames than with out it (Core i7 860 vs Core i5 760).

This statement makes no sense to me.
When doing *nothing*, you get 10% extra performance *for free* just because HT is enabled...
So obviously programs do *not* need to be HT optimized.
They *could* be optimized, because there *might* be more than 10% gain from HT if programmed carefully...

But it doesn't seem to make a lot of sense to *avoid* HT by running only 4 threads on a 8 logical core machine, judging from this 10% of free performance.

Aside from that, this has nothing to do with what you said earlier, and my response to that (on how physical cores handle logical cores/thread workload). Do you understand what I said?

Keysplayr · Aug 19, 2010

Ok, attached the 24" and ran at 1920x1080, all setting are the same as my 16x10 test.

1680x1050 & 1920x1080, Full Screen, Refresh: 59.88, VSync OFF, MSAA4x, Motion Blur: ON, Shadow Detail: High, Texture Detail: High, Rendering Level: High, DX11 Features: High.

Test A: 16x10
Scene1: 58.5, Scene2: 54.1, Scene3: 67.4, Overall Avg: 58.1fps RANK "B"
Test A: 19x10
Scene1: 53.3 , Scene2: 48.7, Scene3: 61.0, Overall Avg: 53fps RANK "B"

Test B: 16x10
Scene1: 49.2, Overall Average: 47.2fps RANK "B"
Test B: 19x10
Scene1: 45, Overall Average: 43.4fps RANK "B"

Single GTX480 @ stock. 258.96 drivers
i7 860 @ 3.4GHz
Win7 64.
8GB DDR3

So, on average, going from 1680x1050 to 1920x1080 I lost 5fps across the board.
I hope this is enough to end this dispute, unless you guys will now demand 25x16 numbers from a single 480. (FYI: not going out to buy a 30"

)

AtenRa · Aug 19, 2010

My exact words where

only one thread is being executed in one cycle in a physical core

In any given time only one Instruction (of a given Thread) is being Executed in the core no mater how many Instructions of deferent threads are fed to the pipeline (fetch-Decode-EXECUTE-Store). Yes you can have an instruction of thread A at the fetch stage, another instruction of Thread B at the Decode stage, another instruction of thread B at the Execute stage and so on, but only one instruction of a given thread is being executed every time and that’s why you don’t want to have a stall in the pipeline so you don’t miss cpu cycles meaning you always want to have a full fed pipeline in order to execute one instruction per cycle.

Because the latest x86 processors are superscalar and OoO (Out of Order) as you said , with deep pipelines, a thread could stall the pipeline (miss prediction) or one thread does not completely fills the entire pipeline and that’s where HT comes to save the day and keep the pipeline full fed with a second thread.

Scali said:
When doing *nothing*, you get 10% extra performance *for free* just because HT is enabled...
So obviously programs do *not* need to be HT optimized.
They *could* be optimized, because there *might* be more than 10% gain from HT if programmed carefully...

That’s what I wanted to say,

Sorry if my English is not good enough to better explain

Scali · Aug 19, 2010

AtenRa said:
My exact words where

In any given time only one Instruction (of a given Thread) is being Executed in the core no mater how many Instructions of deferent threads are fed to the pipeline (fetch-Decode-EXECUTE-Store). Yes you can have an instruction of thread A at the fetch stage, another instruction of Thread B at the Decode stage, another instruction of thread B at the Execute stage and so on, but only one instruction of a given thread is being executed every time and that’s why you don’t want to have a stall in the pipeline so you don’t miss cpu cycles meaning you always want to have a full fed pipeline in order to execute one instruction per cycle.

This is not correct.
A superscalar architecture means that there are multiple execution units working in parallel, so you can execute more than one instruction per cycle in parallel.
Current x86 CPUs have a theoretical maximum of retiring 6 instructions in 1 cycle, but on average you'll see them processing 2 to 3 instructions per cycle.

So it's not 'the pipeline', the key to a superscalar architecture is that you have multiple pipelines. Since the Pentium Pro, Intel calls them 'execution ports' as I said.

Basically what you're saying is from the 486 age. The Pentium was the first superscalar x86, and it processed 2 instructions in parallel from the same thread (it had two decoders and two execute pipelines, the U and V pipeline).
Pentium Pro expanded on that further, being able to handle 3 instructions in some cases (three decoders and an out-of-order pipeline with 5 execution ports)... and in a nutshell Pentium 4 added HT to that, which means that those 2-3 instructions that can be executed at the same time, no longer have to come from teh same thread... which improves the efficiency of the superscalar architecture, as instructions from different threads are independent from eachother's inputs and outputs by definition.

AtenRa · Aug 19, 2010

Scali said:
being able to handle 3 instructions in some cases (three decoders and an out-of-order pipeline with 5 execution ports)... and in a nutshell Pentium 4 added HT to that, which means that those 2-3 instructions that can be executed at the same time, no longer have to come from teh same thread... which improves the efficiency of the superscalar architecture, as instructions from different threads are independent from eachother's inputs and outputs by definition.

The principle is the same for each execution pipeline within the execution Unit in each core in the Processor, in the P4 we could have a stall in the pipeline of one of the execution units and by incorporating HT it could give another Thread (instructions) on that stalled pipeline in order to improve the efficiency of the IPCs

Scali · Aug 19, 2010

AtenRa said:
The principle is the same for each execution pipeline within the execution Unit in each core in the Processor, in the P4 we could have a stall in the pipeline of one of the execution units and by incorporating HT it could give another Thread (instructions) on that stalled pipeline in order to improve the efficiency of the IPCs

You said: "In any given time only one Instruction (of a given Thread) is being Executed in the core no mater how many Instructions of deferent threads are fed to the pipeline (fetch-Decode-EXECUTE-Store)."
Which is wrong. You can have one instruction executed in every *execution port* of the core.
Since the P4 is an out-of-order architecture, it already accounts for stalled (or well, just busy) execution ports in some way. The problem that remains is that with a single thread, you may have a lot of instructions dependent on results of previous instructions.
This is what a stall means. It means that execution units are sitting idle, because there are no instructions that can be fed. There are generally far more execution units available than instructions ready for execution.

By feeding instructions from two threads at the same time, the reordering logic has a lot more independent instructions to choose from, and can keep the execution units more busy.

In other words:
At any time there will be one or more instructions being executed by the CPU. These instructions can come from either thread.

AtenRa · Aug 19, 2010

I believe we saying the same things with different words, I was trying to simplify the process to one pipeline (execution port) anyway good dialogue

Scali · Aug 19, 2010

AtenRa said:
I believe we saying the same things with different words, I was trying to simplify the process to one pipeline (execution port) anyway good dialogue

I don't think we're saying the same thing, because you cannot simplify the process to one execution port...
You no longer have a 'pipeline' as such.
You just have (pipelined) execution ports. Different execution ports perform different operations. Together they form a 'virtual pipeline' that handles all instructions.
But the whole point of these execution ports is that the pipeline no longer stalls as a whole, so independent instructions can be executed out-of-order (at least until the independent instructions run out).

AtenRa · Aug 19, 2010

Scali said:
But the whole point of these execution ports is that the pipeline no longer stalls as a whole, so independent instructions can be executed out-of-order (at least until the independent instructions run out).

Well, if you have a miss prediction (OoO) the whole pipeline could be stalled, empty and fed again for a new cycle.

Taken from Programming with Hyper-Threading Technology

The NetBurst architecture is particularly adept at spotting sequences of instructions that it can execute out of original program order, that is, ahead of time. These sequences are characterized by:

■ having no dependency on other instructions;
■ not causing side effects that affect the execution of other instructions (such as
such as modifying a global state).

When the processor spots these sequences, it executes the instructions and stores the results. The processor cannot fully retire these instructions because it must verify that assumptions made during their speculative execution are correct. To do this, the assumed instruction path and context are compared with the correct path instruction path. If the speculation was indeed correct, then instructions are retired (in program order). However, if the assumptions are wrong, a lot of things can happen. In a particularly bad case, called a full stall, all instructions in flight are terminated and retired in careful sequence, all the pre-executed code is thrown out, and the pipeline is cleared and restarted at the point of incorrect speculationthis time with the correct path.

Scali · Aug 19, 2010

AtenRa said:
Well, if you have a miss prediction (OoO) the whole pipeline could be stalled, empty and fed again for a new cycle.

Yes, that's just one of those cases where your independent instructions run out (all instructions have to wait until the mispredicted instruction has executed properly).

But clearly HT does a LOT more than just benefiting from mispredictions... Especially considering how high a success rate the prediction algorithms in a modern CPU have.

These two points are important:

■ having no dependency on other instructions;
■ not causing side effects that affect the execution of other instructions (such as
such as modifying a global state).

Instructions taken from two threads are indepedent and cause no side-effects *by definition*. So having a second thread as instruction source will greatly improve your ooo-efficiency.

AtenRa · Aug 19, 2010

Agreed

evolucion8 · Aug 19, 2010

Scali said:
This statement makes no sense to me.
When doing *nothing*, you get 10% extra performance *for free* just because HT is enabled...
So obviously programs do *not* need to be HT optimized.
They *could* be optimized, because there *might* be more than 10% gain from HT if programmed carefully...

But it doesn't seem to make a lot of sense to *avoid* HT by running only 4 threads on a 8 logical core machine, judging from this 10% of free performance.

Aside from that, this has nothing to do with what you said earlier, and my response to that (on how physical cores handle logical cores/thread workload). Do you understand what I said?

But it isn't like always you will get the same 10 percent boost in performance across all scenarios. There will be times where Hyper Threading can give you more than 30 boost in performance, and some scenarios where HT will not do anything at all, specially in very dynamic branchy code with dependency AFAIK.

Search

Lost Planet 2 DX11 - noticeable tessellation

Scali

Banned

Keysplayr

Elite Member

AtenRa

Lifer

Scali

Banned

AtenRa

Lifer

Scali

Banned

AtenRa

Lifer

Scali

Banned

AtenRa

Lifer

Scali

Banned

AtenRa

Lifer

evolucion8

Platinum Member

TRENDING THREADS