Pathscale Looks to One Up CUDA and OPEN CL with new GPU compiler

cbn

Lifer
Mar 27, 2009
12,968
221
106
http://www.hpcwire.com/features/Pat...CL-with-New-GPU-Compiler-97089024.html?page=1

According to PathScale CTO Christopher Bergström, interest in doing a GPU compiler began shortly after the company rebooted last year. Since NVIDIA was leading the GPGPU charge, they started with the idea of targeting the Tesla GPU line. Hoping to reuse some of NVIDIA's CUDA stack, they quickly found that the code generator and driver were not optimized for performance computing. "Their drivers, which really dictate quite a bit of what you can do, are supporting everything from gaming to HPC," says Bergström. "It's not that they haven't built a good solution. It's just not focused enough for HPC."

Moreover, they found writing CUDA code for performance tedious, requiring a lot of programmer hand-holding to optimize performance. In particular, the PathScale engineers found that the register usage pattern in the CUDA compiler was generalized for all types of GPU cards, so performance opportunities for Tesla were simply missed.

The twist here is that GPU ISA is volatile -- at least more so than say a CPU. Fortunately, the instruction and register enhancements tend to be incremental. Bergström says they will support all the latest GPU cards being used for HPC, that is, essentially all the cards supported in the three generations of Tesla products. PathScale has a working pre-"Fermi" driver now and is working on the compiler port. "We just got access to the hardware last month," explains Bergström. "So we've basically had 30 days to start tackling the ISA and the registers." He predicts they'll have a fairly robust Fermi port within the next 60 to 90 days.

A Fermi HPC Port? How *much* could this affect Nvidia's capacity to convert lower profit GeForece SKUs into High Profit Tesla SKUs?

Bergström is careful not to claim performance superiority over the CUDA technology just yet. He says ENZO is currently in the alpha or early beta stage. According to him, PathScale engineers have hand-tuned some code using GPU assembly, and have achieved a 15 to 30 percent (or better) performance boost. In other cases, they're not quite there and need to find the right optimizations. Bergström is confident that those hand-coded optimizations can be incorporated into the compiler infrastructure. They have identified a number of areas where they can reduce register pressure, hide latency, reduce stalls and improve instruction scheduling. "We know the performance is there," says Bergström.

15 to 30 perent performance boost doesn't sound like much if Nvidia always has the upper hand with the architecture development direction.

But if they can make writing code less tedious (as mentioned in the article) could this help pave the way for innovation in program applications? Lowered HPC programming costs + ability to use regular gaming cards, sounds like a chance for Higher risk taking with projects.
 
Last edited:

cbn

Lifer
Mar 27, 2009
12,968
221
106
From page 2 of the article said:
The other part of the story is that CAPS, along with PathScale (and some as yet unannounced players) have decided to make the HMPP directives an open standard. The idea here is to attract application developers and tool makers to a standardized GPU programming model which protects their investment but is still targeted at gaining best performance.

http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36

Can someone help explain this? High Level "abstraction" for Hybrid computing merging many Core CPU with OPEN CL and CUDA? Judging by the diagram this must be a way to lower the amount of code needed.
 
Last edited:

ViRGE

Elite Member, Moderator Emeritus
Oct 9, 1999
31,516
167
106
What makes this interesting is that this isn't just a new runtime built on top of CUDA, it's a new runtime built right off of the ISA. These guys are basically poking and prodding at G80/GT200/GF100 half-blind trying to figure out how it works, and then compiling code straight to the bottom. It's ballsy for sure, as NVIDIA won't help them and it's very difficult to do any of this without the original documentation.

What I'm wondering is whether it will really take off. The HPC landscape is littered with the corpses of companies selling HPC development platforms and runtimes. Few of them are successful. NVIDIA is leveraging Visual Studio for CUDA on Fermi, and that's going to be very hard to beat even if ENZO is faster on average.
 

Voo

Golden Member
Feb 27, 2009
1,684
0
76
What I'm wondering is whether it will really take off. The HPC landscape is littered with the corpses of companies selling HPC development platforms and runtimes. Few of them are successful. NVIDIA is leveraging Visual Studio for CUDA on Fermi, and that's going to be very hard to beat even if ENZO is faster on average.
Well at least the compiler sounds like a good idea and could be very much worth it. I mean we don't even have to talk about register allocation, the current CUDA compiler does not even a good job (if he does it at all) at simple things like loop unrolling - that's something you can fix as the developer yourself, but it gets a mess, is more code and all in all that's something a compiler can do without problems.

So there are obviously performance boosts possible, though as far as I know Nvidia doesn't disclose the internals of their architecture, so they have to look at the assembled code to find out how it works? But the more important question is: How long it will take nvidia to incorporate those changes themselves? It's not as if they're doing nothing.
 
Last edited:

pathscale

Junior Member
Jun 29, 2010
2
0
0
Our goal is very simple.. Take existing code which in general would be a good fit to be offloaded to the gpu.. Drop in some pragma and turn those regions into native gpu code with callsites surrounding them. It's that simple to get started. Of course there will be things we will be doing to increase the unmodified code performance, but until then many of the Nvidia best practices will still apply.

Since we're going direct to ISA we need to limit our focus mostly to the TESLA series of cards and what's being deployed in clusters.

We'll publish our HMPP quick reference and more details on the directives soon, but hope this information helps..
 

DrMrLordX

Lifer
Apr 27, 2000
22,939
13,024
136
Um . . .hi pathscale people!

Do you plan on adapting your approach to GPGPU computing to any of AMD's products? If their Llano APU sells well, there's going to be an enormous number of mobile and desktop users out there with stream processors under the hood that may or may not be using them to their fullest potential, especially if they don't game. AMD seems fairly intent on helping developers tap into those stream processors with OpenCL, but if your product lets people recompile existing code into an app that can be accelerated by a GPU, then it may be very useful indeed.
 

pathscale

Junior Member
Jun 29, 2010
2
0
0
Um . . .hi pathscale people!

Do you plan on adapting your approach to GPGPU computing to any of AMD's products? If their Llano APU sells well, there's going to be an enormous number of mobile and desktop users out there with stream processors under the hood that may or may not be using them to their fullest potential, especially if they don't game. AMD seems fairly intent on helping developers tap into those stream processors with OpenCL, but if your product lets people recompile existing code into an app that can be accelerated by a GPU, then it may be very useful indeed.

I think there's four markets here to look at
1) HPC (Which we can group to also mean financial, energy and anyone who really needs to crunch some numbers)

2) Gaming (Making high performance drivers and shader compilers which deliver absolute best performance)

3) Console developers (People writing the games in Cg and other languages)

4) General applications (Firefox, OpenOffice and friends)



Right now we're *very* focused on the NVIDIA Telsa series cards as that's where the vast majority of market demand is. To add support for AMD/ATI they would have to disclose a small amount of details on how the compute programs are launched which isn't available in their current documentation. While we can reuse a very large portion of our stack there would still be a huge investment of time needed to really drive the performance. I'd humbly like think we're in a great position to do this, but the project would need external financial support to really do it well. Taking into account the above markets I really see them as a progressive thing we have to tackle one at a time..

HPC is obviously our sweet spot and where most of the interest is today.
Add shader compiler support to make really high performance drivers
Add a front-end which console developers can use to easily squeeze every last bit of performance out of the hardware
Lastly, but certainly not least important general applications. There's a *huge* amount of work to be done before everyday applications can easily take advantage of the GPU.

I'm happy to answer any questions and get honest feedback.
 
Last edited:

cbn

Lifer
Mar 27, 2009
12,968
221
106
http://www.khronos.org/news/archives/

I thought the second post (dated July 22nd) in the Khronos news section was interesting.

If you liked Assembler, you'll love Open CL

In todays world of copy paste programming expertise, OpenCL will force you to learn the nuts and bolts of programming.

From the link itself--->http://www.streamcomputing.nl/blog/2010-07-22/the-rise-of-the-gpgpu-compilers

In the early days of BASIC, you could add Assembly-code to speed up calculations; you only needed tot understand registers, cache and other details of the CPU. The people who did that and learnt about hardware, can actually be considered better programmers than the C++ programmer who completely relies on the compiler’s intelligence.

Okay, let’s be honest: OpenCL is not easy fun. It is more a kind of readable Assembly than click-and-play programming. But, oh boy, you learn a lot from it! You learn architectures, capabilities of GPUs, special purpose processors and much more.

I want to stress that understanding hardware architectures stays important for GPGPU and any other programming language with performance in mind. In 2010 and 2011 you’ll still see OpenCL in the light and before you know it’s all hidden in libraries and handled by the compiler; so learn! The only difference is that the programmer still must understand where the software is to be used on.

I am not a programmer, but just to my lay person ears it sounds like OPEN CL is much more difficult to use.

I guess that is whole point of a company like Pathscale right? Take a small group of people very knowledgeable at the hardware level and extrapolate this out into an easier to program format for larger groups of programmers.