Haswell to support transactional memory in hardware

Nemesis 1 · Feb 10, 2012

Abwx said:
No , because i m not gullible and that i take the time to read articles,
as the one from Hardware.fr

Really . I won't take the time now . but tomorrow I will look threw old pre release BD topics see if we can come up with gullible material

Dresdenboy · Feb 10, 2012

blckgrffn said:
AMD, you can get this via IP sharing, no? Hopefully yes.

AMD is working on that for a while now:
AMD's ASF
Velox project
OSRC stuff
patents

Olikan · Feb 10, 2012

Dresdenboy said:
AMD is working on that for a while now:
AMD's ASF
Velox project
OSRC stuff
patents

now steamroller makes sence with it's "greater parallelism"

Cerb · Feb 10, 2012

Article (linked by Abwx) said:
transactional memory requires an intervention by the programmer (via intrinsic this may also be the case for math instruction) or a new memory model for the programming language (the model proposed above transactional memory for C + 11). If we add to that the unknown with concern the eventual implementation of AMD, how RTM may have an even harder to win than other extensions to the x86 instruction set.

Any handling of a shared memory space requires the programmer to deal with how to share it (including making it non-shared spaces, from the program's PoV). That intrinsics may be involved should just be an early adoption penalty.

With highly varied and competing STM implementations, offering long-term support concerns, performance concerns, unknown-in-advance learning curves, and somewhat unknown degrees of correctness (this being somewhat of a trust issue), the cautious wisely stay away, unless it perfectly fits their needs. If one common implementation can be coalesced upon (an Intel-compatible one, FI

), offering good performance across-the-board by hardware support for small writes, that can change. If all goes well, we could even see smaller entities, like ARM, MIPS, and Renesas, create compatible HW implementations.

Working with hardware instructions to get it done, as an application developer, should be a short-term problem, whether it gets supported better by languages directly, or implemented entirely through libraries.

Intel saying, "here is how we are going to do it," and that way not being extra proprietary/screwy, is pretty important. If say, IBM, ARM, and MIPS came out with such features, and treated them entirely differently, things would be a mess. With Intel doing it in a way that is fairly straight-forward, and not unlike IBM (the only other big company to have implemented it, TMK), there's a good chance they have already, before the chip's release, decided how everyone else is going to do it.

Nemesis 1 said:
Really . I won't take the time now . but tomorrow I will look threw old pre release BD topics see if we can come up with gullible material

Just wait. AMD will tout how awesome it is that they do it, when whatever chip that does is going to come out, and try to spin it like they're special, with cherry-picked benchmarks and colorful graphs, even though they just do the same thing Intel has already done. AMD has not been standing still on this, but as far as ISA implementation, they are stuck following Intel on a leash.

* not sure about Oracle, ATM

Dribble · Feb 10, 2012

Cerb said:
The greatest benefit is that it makes a lot more sense, when sharing data across threads. Let each thread read and write as it will, with checks to verify correctness, and then allow a globally-visible commit, or fall back. It would be wrong to say it is simple, but it would be right to say that it fits most humans' thought patterns far better than locking. Using locks in a traditional way is very much a choice of lesser evil over greater evil (lockless operation with shared memory--run away in fear!).

That's all there today in read/write locks. Any number of processes can grab a read lock to data and they can all access it concurrently, but there is only one write lock which can only be obtained when there are no read locks. That means concurrent multithreading works efficiently and locks only stop everything when they really need too.

Big difference from what I can tell with my limited knowledge is that with TM when someone commits back everyone else has to abort which means they all go back to square one like would have happened with locking only they wasted a lot of cpu cycles processing data that was then deemed out of date.

GammaLaser · Feb 10, 2012

Dribble said:
That's all there today in read/write locks. Any number of processes can grab a read lock to data and they can all access it concurrently, but there is only one write lock which can only be obtained when there are no read locks. That means concurrent multithreading works efficiently and locks only stop everything when they really need too.

Big difference from what I can tell with my limited knowledge is that with TM when someone commits back everyone else has to abort which means they all go back to square one like would have happened with locking only they wasted a lot of cpu cycles processing data that was then deemed out of date.

The big difference is that with TM the hardware is keeping track of conflicting memory accesses. You may have a read/write lock but if your writer happens to not be modifying any of the data your readers are interested in, then your readers are waiting for the lock unnecessarily. You could try to make a finer-grained lock but the idea is that with TM the programmer is relieved from the burden of figuring out the optimal locking. The hardware will be as optimistic as possible so that threads don't have to wait, and then provide a fallback when that optimism might create a functional failure.

BrightCandle · Feb 10, 2012

On heavily contended locks STM performs worse and consumes more power, but on lightly contended locks it performs better. Its not exactly the solution to parallelism problems though it'll just help in a few circumstances. Clojure should benefit quite a bit once the JVM gets support it can use (since its heavily dependent on STM and immutable data).

Phynaz · Feb 10, 2012

Abwx said:
No , because i m not gullible and that i take the time to read articles,
as the one from Hardware.fr

Third time you have referenced the same blog post.

Do you have anything new to add?

Cerb · Feb 10, 2012

Dribble said:
That's all there today in read/write locks.

No, there's locks

. That's basically it: getting rid of those, from the PoV of the application programmer, without going into the wild west of lock-free shared-everything. TANSTAAFL, but good TM could make the drinks half price.

Abwx · Feb 10, 2012

Phynaz said:
Third time you have referenced the same blog post.

Do you have anything new to add?

Third ??...
Either you are lacking in elementary calculus or you need
a new pair of glasses....

For the rest , all is said in the said article....

Tuna-Fish · Feb 10, 2012

Dribble said:
That's all there today in read/write locks. Any number of processes can grab a read lock to data and they can all access it concurrently, but there is only one write lock which can only be obtained when there are no read locks. That means concurrent multithreading works efficiently and locks only stop everything when they really need too.

This is a very simplistic view assuming fine-grained locking. In practice, it's never that easy. Taking locks (even read locks) has a cost. Taking locks that were previously acquired in another thread has a huge cost, even if they are free now. In reality, you have the option between coarse-grained locks (not much performance lost to synch primitives, but any writes stall everything), and fine-grained locks (you stall only when you need to, but you spend most of your time manipulating locks.)

Also, since locks need to be taken in order to avoid deadlocks, systems built with locks never compose. If you have a function that uses locks entirely correctly, you cannot take a new lock, call the function and trust that your program is correct without examining that function.

Synchronization shared data is not solved. Ask anyone who has ever implemented a large system using locks. Locks just don't scale -- neither to systems with a lot of threads, nor to complex systems with a lot of data.

Big difference from what I can tell with my limited knowledge is that with TM when someone commits back everyone else has to abort which means they all go back to square one like would have happened with locking only they wasted a lot of cpu cycles processing data that was then deemed out of date.

But only when there actually *is* a collision. With locks, you always pay most of the cost of synchronization. Most of the time there will be no collisions. With TM, you pay nothing when the transaction successfully commits, and pay the cost of rolling back only when it is absolutely necessary.

Tuna-Fish · Feb 10, 2012

Phynaz said:
Thanks for the link. I read through chapter 8.

You know what I'm thinking....This is a first step towards speculative execution. This is the first of the control mechanisms that will be required. I wonder now how much of this initially started with the Mitosis research project?

This is not the first step, this *is* speculative execution. xbegin marks the beginning of the transaction, and the cpu will speculatively execute starting there until the end of the transaction.

Phynaz · Feb 10, 2012

Awesome info guys. I don't have a programming background and some of the posts in this thread are very helpful!

Phynaz · Feb 10, 2012

Abwx said:
Third ??...
Either you are lacking in elementary calculus or you need
a new pair of glasses....

Put your money on both

Phynaz · Feb 10, 2012

Abwx said:
As a side note , i will add that the topic title is misleading

Then report it.

nyker96 · Feb 10, 2012

blckgrffn said:
Having seen a "variation" of this sort of thing introduced in file systems, which have relied on traditional locks like this in the past, the upsides and practical throughput gains can be pretty enormous. I would think the more cores you have the more important this technology is to maintain throughput.

I say that because this is important when you have multiple servers accessing a clustered files system, which is how I view cores competing for memory access. It probably isn't perfect or even 90% accurate, but it makes sense of this concept for me.

Cerb said:
The greatest benefit is that it makes a lot more sense, when sharing data across threads. Let each thread read and write as it will, with checks to verify correctness, and then allow a globally-visible commit, or fall back. It would be wrong to say it is simple, but it would be right to say that it fits most humans' thought patterns far better than locking. Using locks in a traditional way is very much a choice of lesser evil over greater evil (lockless operation with shared memory--run away in fear!).

Software transactional memory has a fair chance of imposing more overhead than it offers in added performance, due to the added overhead of managing transactions. Each transaction must be isolated until commit, must have a way to verify that it should commit, that it did or did not commit, and then something to do on commit failure. When the transactions themselves are for very small amounts of work and/or memory, keeping up with that can take longer than waiting to grab a lock. Meanwhile, optimistic locking gets you halfway into the problems of trying to go lock-free.

I agree with much that is said here by both of you. In theory it look very good. But as a software engineer, I must say, the topic of threading isn't an easy one. And from my own experience, in fact I have a piece of software right now that I'm testing for threading benefits. It's very hard to say because many times, you have to really look at the actual test data to see if the added overhead of threading can actually be overcame by the benefits of running on all cores. Sadly I have to say many software which may seem a prime candidate to threading doesn't benefit as much or get even worse performance after you threaded its core algorithm.

If this turns out to have major practical gains I will certainly use it in the code provided it's backward compatible on older cpus like they claim.

gorydetails · Feb 10, 2012

so stm makes it slower?????

Cerb · Feb 11, 2012

gorydetails said:
so stm makes it slower?????

No, it can be slower. On average, it should be faster for high-concurrency programming, and even when not, should make for easier for higher-level programming (C#, FI; though C++ may have wide support first).

For easy safe operation, you can use simple pessimistic locking. In this case, scalability sucks. Optimistic locking, and finer-grained locking, lead to emergent bugs all the time. Locks themselves may be simple, but 20+ locks, not all of which behave the same way, are not, and you still have to carefully analyze the global changes you make, to be sure you won't create races.

Alternatively, you can use shared-nothing, leaving you with no need to worry about locks, except for I/O. Even a more efficient shared-nothing system, doing CoW, is going to eat up time and bandwidth compared to much sharing. In addition, a task that looks parallel, but with overlapping accesses, often, especially w/ compositions, ends up having far worse performance than just serial code, because it's processed mostly serially, but with all the extra overhead of unshared threaded code. Ugh. Sometimes, you want shared memory, but you still want dangerous race protection.

With transactional memory, you can get rid of most locks (improving performance and making for easier to follow code), but you must have a way to keep (or create while writing) a clean 'before' state, detect a potential conflict (safe races may be detected as conflicting), and recover from a conflict. But, you get the performance advantages, on a single box with many concurrent memory operations to the same space, of sharing, in that you aren't wasting memory time and bandwidth making too many unnecessary copies, you aren't synchronizing (often flushing) memory just in case, you aren't blocking a bunch of safe code just in case; and while is plenty of room for compiler, run time system, and code bugs, there is far less room to create a mire of locks that all bug guarantees a hard to track down bug some day. It won't make a bad design work better, but can make a good design easier to implement and maintain.

The catch is that the whole data memory space of the application has to be able to be treated like a coherent database, and that's not free. Every single write to shared memory is effectively encapsulated in a check-try-catch-finally block. The performance cost of that complication, even in the common case of no detected conflicts, is part of why, to become more than a niche feature, it needs some hardware backing, to keep the overhead down.

With real STM implementations coming along (ICC, GCC), non-academic implementation research having mostly been successful (STM.NET and Velox, that I know of), pure software implementations having been fairly well tested (not always being speedy, though, and adding a complicated management layer to deal with), and the need becoming more obvious as we get into having more and more cores available, now is just the right time to get hardware support, to lessen the performance burden.

Idontcare · Feb 11, 2012

gorydetails said:
so stm makes it slower?????

As Cerb states, it is not about making is slower, it is about thread-scaling and the resultant performance improvement being closer to ideal (less hindered) versus having a bunch of inefficient locks going on in which the performance scaling is depressed.

Depending on how you handle the coding, you could end up with thread scaling that is like the red or green curves, memory contention with more and more locks that scale with thread count actually results in performance slowing down.

But you could avoid that with transactional memory, getting yourself to the blue or black scaling curve...but there is always opportunity to mess it up and take an application that otherwise would perform at the black level and turn it into an app that scales at the blue level.

But transactional memory should not degrade your scaling from that of the black line to that of the red line, whereas a coder or compiler can do that in an instant with the wrong combo of locks and so on.

Transactional memory makes it easier to not shoot yourself in the foot as a programmer. It is not necessary for your top-1% coders out there, but as a software company you have to deal with the reality of your workforce being composed of the other 99%, and transactional memory helps that 99% be who they are and not have that be a liability to their employer or the end-customer.

Tuna-Fish · Feb 11, 2012

Cerb said:
With real STM implementations coming along (ICC, GCC), non-academic implementation research having mostly been successful (STM.NET and Velox, that I know of), pure software implementations having been fairly well tested (not always being speedy, though, and adding a complicated management layer to deal with), and the need becoming more obvious as we get into having more and more cores available, now is just the right time to get hardware support, to lessen the performance burden.

Just a note: Clojure has probably the most widely used STM implementation out there, and it's generally well received.

Ajay · Feb 12, 2012

nyker96 said:
If this turns out to have major practical gains I will certainly use it in the code provided it's backward compatible on older cpus like they claim.

Well, the hardware instructions won't be backward compatible with older CPUs, but with older software.

I don't see this as a huge development for software engineering, since debugging highly threaded concurrent server engines over a muti-tiered distributed applications remains a b*tch. But, getting higher performance on the server side will be very valuable. The impact for the programmer may be a net zero, because if a write request fails, one may have to determine if the request is still even valid rather than simply reexecuting write hoping the hazard has passed. Database design will need to become even more intelligent wrt distributing data to reduce contention. With could computing on such a large scale, databases may have to restructured on the fly.

I haven't been involved with these sort of problems since 2008, so I don't know how back ends and databases are evolving to handle the new trends in computing.

PS. Interesting chart IDK, where is it from?

IlllI · Feb 12, 2012

will this allow me to play doom faster?

janas19 · Feb 12, 2012

IlllI said:
will this allow me to play doom faster?

... still lol'ing... rofl

exar333 · Feb 13, 2012

IlllI said:
will this allow me to play doom faster?

Yes, if you want to play 2000 concurrent Doom applications at the same time.

Phynaz · Feb 13, 2012

Ajay said:
Well, the hardware instructions won't be backward compatible with older CPUs, but with older software.

Actually yes, the instructions are backwards compatible. You can run this code today, on a SB CPU and it is just ignored and it falls back to the old locking method.

Haswell to support transactional memory in hardware

Lifer

Golden Member

Platinum Member

Elite Member

Platinum Member

Member

Diamond Member

Lifer

Elite Member

Lifer

Golden Member

Golden Member

Lifer

Lifer

Lifer

Diamond Member

Member

Elite Member

Elite Member

Golden Member

Lifer

Diamond Member

Platinum Member

Diamond Member

Lifer