AMD's HSA .. when the same on CPUs?

DigDog · Mar 9, 2014

Finally AMD has given us the ability, at consumer level, to use multiple hardware parts of different computational power to share a task ..

imho, HSA is the best thing to come in years. A milestone, almost.

(not really a big fan of AMD or ATI, me, but credit is due)

Heterogeneous System Architecture allows a PC to use both onboard and discreete GPUs to work in crossfire, sharing the workload despite being very different in abilities.

Now you might say, why is this in CPUs & OC??

Well the big if is, why aren't we getting the same in CPUs?

The question i've had since the very first dual-core cpus emerged, is why is it that we need software coded for multithreading, instead of giving the task of threading an app to the hardware?

Imagine a motherboard with, 2, 4, 6, any number of sockets. Even mobos with riser boards dedicated to CPUs only.
Now imagine a hardware controller, a secondary cpu, whose task is solely to divert the workload to the cores, and then put it back together.

At the moment, threads don't work like this. But if one could separate bits, or parts of an instruction, and then have multiple CPUs process this, then we wouldn't have to strive for smaller nodes, or faster clocks.

a nice 32 core CPU cluster, with attached 4 core controller, could be revolutionary.

Now i ready my box of tissues cuz you are your reality checks are gonna make me cry/

monstercameron · Mar 9, 2014

there is no automatic parallelism involved with opencl, having off-die communication undermines HSA and you can use amds opencl stack on any x86 cpu.

jhu · Mar 9, 2014

HSA doesn't free the programmer from figuring out how to make his code parallel.

Phynaz · Mar 9, 2014

Look up project Mitosis.

NTMBK · Mar 9, 2014

Automatically extracting parallelism from serial code is staggeringly difficult. Even at compile time it is very hard to get any level of parallelization for free. Doing it at runtime is nigh on impossible.

I think you misunderstand what HSA is. It provides a common IL which can be JIT compiled to CPU or GPU instructions, and a runtime which distributes parallel work across both, but the software still needs to be written using a model which expresses parallelism. It doesn't give the developer a free lunch.

piesquared · Mar 9, 2014

Phynaz said:
Look up Mitosis.

Sick. No thanks, look up Yourowntosis.

HSA gives developers the opportunity to use all accelerators within the APU, and provides tools to make them as easy to program for as the serial block.

lefty2 · Mar 9, 2014

HSA is good, but we need OpenCL 2.0 drivers before developers can start to use it and they won't arrive for a few more months.

BrightCandle · Mar 9, 2014

I don't think you understand how this stuff is programmed at all. HSA is even harder to program for than current multithreaded CPUs are, dramatically harder. There are severe restrictions on the sorts of things that can be run on a GPU with HSA, a far smaller set of problems than a multithreaded CPU can do. Seeing as how multithreading hasn't really taken off in a big way you can guarantee we'll see even less software with HSA.

There is no magic bullet to solving the performance problem on CPUs, and I am certain the answer is not HSA in any of its current forms.

DigDog · Mar 9, 2014

Phynaz said:
Look up project Mitosis.

link plx

Anyway; i was just using HSA as an example. My argument is, why haven't we looked at hardware solutions to divide a workload? Can it not be done? Imagine writing code for a single thread only, but a hardware controller divides it between multiple cores. Surely this is more efficient than increasing clocks? We're down to 20nm, we're not getting much smaller.

Headfoot · Mar 9, 2014

Doesn't work even remotely like that. http://en.wikipedia.org/wiki/Amdahl's_law, at least the way you're thinking about it.

Parallelism in the multi-core/multi-threaded sense is completely architectural. You have to approach the same problem differently; different algorithms, different mindsets. Very very different. You need better code insight in order to determine which parts of your code bottleneck you, then you have to determine if that part can be parallelized. Some things simply can not be parallelized. Even if it can be, the effort and time involved requires a much higher level of programming skill and multi-threaded code gets increasingly difficult to debug, meaning it takes increasingly long to write, meaning it costs increasingly more money to code it. Making code more parallel has been one of the top objectives of highly intelligent computer scientists and coders for a few decades now, at least since the mid-to-late 90s for sure.

The "hardware solution" to divide code already exists as instruction level parallelism techniques at the chip level. It is naive to think no one has looked at hardware methods to multithread. Of course they have. They've been looking at making code more parallel ever since the first network was developed decades ago. There are plenty of ways to make a problem more parallel that translate to the single socket level, most of them are now common "server" loads since they distribute well. Webserving, for example. You add hardware/software load balancers at the front end and it sends page requests to as many servers as you configure it to. That is hardware workload division.

BrightCandle · Mar 9, 2014

Even the most simple of problems don't multi thread well. Take for example summing up a list of numbers. All I need to do is add them all up to determine the total count.

The most efficient way to do this for say 16 numbers would be to first add up 8 pairs of numbers on 8 CPUs. Then what happens is that I now have 8 sums which I can add up on 4 CPUs. Now I have 4 sums that I can pair up and add on 2 CPUs. Then finally I can add the two sums up to get the final sum. This produces a tree where the maximum number of processors that can be used is the length of list divided by 2. But it will take O(log(n)), ie the height of the tree executions to get the total summation. Its a better algorithm that just adding them all up individually on a single CPU, that would take 16 operations instead of the 4 from above but it also uses a lot less CPUs and has less interaction between the CPUs, which in this case would make the parallel solution in practice slower.

Its just not simple to make parallel software, even stupidly simple things can't be multithreaded properly. I kind of class algorithms in to 3 buckets:
1) Embarassingly parallel - do the same operation on lots of individual data packets and not interaction between them. Multithreads extremely well and easily.
2) Tree parallel - Like the summation above the algorithm naturally falls into a tree and thus utilises cores poorly but does gain for multithreading at least a decent amount.
3) Possible (1) or (2) - These tend to code that is vast and complicated but we believe its likely got a parallel solution its just we can't tell because there is too much code and its not written in a way that makes it easy to try parallel solutions. Very expensive to rewrite to test.
4) Single threaded - Plenty of operations require single threads, have no known parallel solution whatsoever and have been proven to not have one at all.

There are many many classifications of algorithms for parallel behaviour across different architectures and such, but the reality is common algorithms we use today regularly fall into complicated situations where they can't be run in parallel or with multithreads for a variety of reasons. Amdahl's law is a simple way to look at it but its much more complicated because we have different classifications of how parallel an algorithm can be.

GPUs and game rendering is actually a great example of something that today scales really well. We have upwards of 4 million pixels on the screen, each of which can be individually determined. We can go get the world data and find the point we are looking at, find the texture, run a series of pixel shaders on it and get a final result. That works great when we have 3000 processors on a GPU, what however happens in 10 years when we have 3 million? Well then its not going to scale anywhere near as well, the best case of scaling here is 1 processor on a GPU per pixel, and the limit of human sight is in and around 12k resolution. At some point even that problem will run out of scale.

Its just more complicated than most people realise. The software we have for writing multithreaded software is beyond primitive. The entire concept of programming today is in serial operation, threads are a tack on. We have almost no tools for finding bugs, limited tools for even dealing with type (3) algorithms above and almost no practical way to make an algorithm behave well on a variety of different core CPUs and architectures. We are decades away from doing this as best as possible, and even then we will find a lot of algorithms that just can't be scaled because mathematics says they can't be.

NostaSeronx · Mar 9, 2014

http://www.anandtech.com/show/1766/7

^Project Mitosis, for those wanting a link.

NTMBK · Mar 9, 2014

BrightCandle said:
Even the most simple of problems don't multi thread well. Take for example summing up a list of numbers. All I need to do is add them all up to determine the total count.

The most efficient way to do this for say 16 numbers would be to first add up 8 pairs of numbers on 8 CPUs. Then what happens is that I now have 8 sums which I can add up on 4 CPUs. Now I have 4 sums that I can pair up and add on 2 CPUs. Then finally I can add the two sums up to get the final sum. This produces a tree where the maximum number of processors that can be used is the length of list divided by 2. But it will take O(log(n)), ie the height of the tree executions to get the total summation. Its a better algorithm that just adding them all up individually on a single CPU, that would take 16 operations instead of the 4 from above but it also uses a lot less CPUs and has less interaction between the CPUs, which in this case would make the parallel solution in practice slower.

But scale the dataset up and it suddenly becomes very amenable to parallelisation. If you're summing 65536 numbers, split it into 8 subsets of 8192 and sum one of those subsets on each processor, and then collapse the sum at the end. The "collapsing" part suddenly becomes pretty trivial compared to the per core workload. Bad example.

But yes, there are plenty of algorithms which due to the underlying mathematics don't scale well with core count.

NTMBK · Mar 9, 2014

The issue with Project Mitosis and other speculative multithreading techniques is that they waste energy, driving up power consumption by firing up 4 threads for a single threaded problem. That kind of thermal abuse is just not acceptable in mobile platforms as it kills battery life, and it is not acceptable in large scale servers which tend to have heavily multithreaded workloads as you will be burning CPU resources that another thread could have put to good use. There's a reason why almost 10 years later Intel hasn't said a word about putting it into a product.

DigDog · Mar 9, 2014

perhaps then the approach is to find a new language which allows for easier multithreading.

Phynaz · Mar 9, 2014

DigDog said:
perhaps then the approach is to find a new language which allows for easier multithreading.

It's not the language. Multithreading is hard to do.

You can also research Itanium and EPIC.

jhu · Mar 9, 2014

DigDog said:
perhaps then the approach is to find a new language which allows for easier multithreading.

The issue is that not all problems can be made parallel. For example:

a = b + c
d = a + e

There is no way to find out what "d" is prior to calculating "a". No language is going to ever going to give that type of prescience.

NostaSeronx · Mar 10, 2014

jhu said:
The issue is that not all problems can be made parallel. For example:

a = b + c
d = a + e

There is no way to find out what "d" is prior to calculating "a". No language is going to ever going to give that type of prescience.

That problem can be made to be temporally parallel. You could have two equal units solve it and in another way you could make it 4-op. Both versions can solve the problem within 1 cycle. Rather, than having it operate on "a" in one cycle then on "d" on the next cycle.

monstercameron · Mar 10, 2014

NostaSeronx said:
That problem can be made to be temporally parallel. You could have two equal units solve it and in another way you could make it 4-op. Both versions can solve the problem within 1 cycle. Rather, than having it operate on "a" in one cycle then on "d" on the next cycle.

isnt that point being made? that there is a depency issue, the processor would have to calculate for "a" before calculating "d".

forgive my ignorance but isnt something like this what sse was made for?

d = b + c + e [ http://software.intel.com/en-us/articles/implement-a-horizontal-addsubtract-with-sse3-instructions ?]

or has the simplicity of the example fooled me into thinking it is that easy?

norseamd · Mar 10, 2014

with all the pc game developers and publishers complaining about multithreading i thought it must actually be easy

seems multithreading is a lot harder than you would think

Headfoot · Mar 10, 2014

DigDog said:
perhaps then the approach is to find a new language which allows for easier multithreading.

This is actually much closer to the reality out there. There are indeed languages which by nature are easier to multithread. Functional programming languages for example are generally easier to multithread, due to the fundamental principle of data immutability. Some languages function solely through message passing, where each piece of code exists on its own and independently of other pieces of code so that when these pieces are run they can be run in parallel. The trick is to make each piece of code as small and single function as possible in these languages. The language Erlang springs to mind. It's quite interesting. The trade-off is that this kind of programming is more difficult for the average programmer, and its less common so there are fewer libraries out there to provide pre-baked plugins

Let me explain a very common multi-threading problem: two (or more) threads are operating simultaneously and need to read then modify the same piece of data. The old school/simple solution is a semaphore or lock. One thread locks the piece of data against all other threads until it is done using it. The problem here is that causes data contention that only gets worse as more threads are added. If you don't lock the data, one thread may modify the piece of data and screw up the other thread, or it might modify the data and the other thread re-writes over that piece of data with a different result so that your program behaves erratically.

In functional programming, you can not modify a piece of data like this. All data is essentially "locked" in that it is immutable. Instead, what you do is copy the piece of data and modify the copy, and continually copy and modify the copy. It is a little more complicated than that but this description gets the gist of it. So, multiple threads can all access this same piece of data which will remain constant and return with different end-results. This is easier to debug and more consistent. This technique can be applied in non-functional languages as well, but then it is a matter of coding discipline and not language design.

What previous posters have said is still true though, certain code paths will have dependencies requiring that some other code is executed before the current code can run making the problem serial. Sometimes you can write around this by changing your approach or your algorithm and sometimes you cant, or its not feasible to change it.

NTMBK · Mar 10, 2014

monstercameron said:
forgive my ignorance but isnt something like this what sse was made for?

d = b + c + e [ http://software.intel.com/en-us/articles/implement-a-horizontal-addsubtract-with-sse3-instructions ?]

or has the simplicity of the example fooled me into thinking it is that easy?

No, HADDPS works on multiple pairs of numbers. Details are here: http://msdn.microsoft.com/en-us/library/yd9wecaa(v=vs.90).aspx

You can then shuffle and HADD again, but you still have the same basic dependency chain issues. Just instead of going parallel over multiple threads you're going parallel over multiple SIMD lanes.

There is the special case of

A = (B*C) + D

which often has a specialised hardware instruction to calculate it in one, as opposed to two instructions- this is called Fused Multiply Add, or FMA. GPUs have had it since forever, but CPUs only just got it in Bulldozer and Haswell. http://software.intel.com/sites/pro...r_c/intref_cls/common/intref_avx_fmadd_ps.htm

NTMBK · Mar 10, 2014

Headfoot said:
Let me explain a very common multi-threading problem: two (or more) threads are operating simultaneously and need to read then modify the same piece of data. The old school/simple solution is a semaphore or lock. One thread locks the piece of data against all other threads until it is done using it. The problem here is that causes data contention that only gets worse as more threads are added. If you don't lock the data, one thread may modify the piece of data and screw up the other thread, or it might modify the data and the other thread re-writes over that piece of data with a different result so that your program behaves erratically.

Don't forget about hardware lock elision, which makes the lock approach much faster. But I agree that it is still a sticking plaster for a programming model not designed to be multithreaded.

sefsefsefsef · Mar 10, 2014

One important example that shows that some computation cannot be performed in parallel is simple addition. You cannot perform addition on all the bits in an integer in parallel. You must first compute the result of adding the bits in the ones place, compute the carry out, then work on the twos place, its carry out, then the fours place and so on. It is inherently a sequential operation.

The other thing that I always like to bring up when someone mentions Amdahl's law is Gustafson's Law: http://en.wikipedia.org/wiki/Gustafson's_law

It's basically a more optimistic view of what Amdahl's law says (in fact, there's a paper out there that proves that they are mathematically the same thing). Amdahl's law is focused on limiting performance improvement, whereas Gustafson's law is about increasing the amount of data you can work on in parallel. This is the view we should all be having, and is exactly what goes on in the world of graphics. They have more parallelism, so they create scenes with more triangles and pixel shader programs.

The truth is, that more cores should be put to use. There is no excuse. App developers should find interesting uses for these extra compute resources. Mostly I'm talking about games, where it's obvious where the additional compute power could go (physics, tons of actors, whatever else).

But the basic idea of the OP is incorrect. This stuff cannot be done in hardware until the hardware can understand the semantics of the problem statement that the code solves for, and come up with an alternate, data-parallel solution. Basically, sci-fi stuff at this point. But even then, the hardware would be limited by Amdahl's law, and the hardware would definitely not be creative enough to throw more data at the problem and take advantage of Gustafson's law, which is IMO the best use for parallel resources.

jhu · Mar 10, 2014

NTMBK said:
.

There is the special case of

A = (B*C) + D

which often has a specialised hardware instruction to calculate it in one, as opposed to two instructions- this is called Fused Multiply Add, or FMA. GPUs have had it since forever, but CPUs only just got it in Bulldozer and Haswell. http://software.intel.com/sites/pro...r_c/intref_cls/common/intref_avx_fmadd_ps.htm

x86 perhaps. IBM POWER has had it since 1990.

AMD's HSA .. when the same on CPUs?

Lifer

Diamond Member

Lifer

Lifer

Lifer

Golden Member

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Senior member

Lifer