[techradar] This startup wants to kill the CPU and GPU in one go

Hitman928 · Jun 26, 2018

https://www.techradar.com/news/this-startup-wants-to-kill-the-cpu-and-gpu-in-one-go

Apologies if this has been posted before.

Some pretty bold claims.

Beyond that new label though are a number of hard-to-believe claims that the company has put forward including the fact that it delivers a 10x performance improvement on conventional processors.

Tachyum’s 64 core Prodigy processor, will generate ~128TFLOPS.

Chances it ever comes to fruition?

DaveSimmons · Jun 26, 2018

Who knows? So far it's just bold claims with no independent data about the design's performance on real-world workloads.

Even if they have some new twists that put them ahead of other designs for the current generation, everyone else might learn from their work and make them a footnote.

We've seen similar claims about new GPU designs in the past, but none have managed to challenge nvidia and ATI-AMD.

LTC8K6 · Jun 26, 2018

Sounds a bit like Larrabee?

Mopetar · Jun 26, 2018

As with everything else that sounds too good to be true, I doubt this is true either. If you could design something like that which would get this kind of performance, then you'd build it and take the market by storm. Computer chip design is sufficiently complex that most people who might otherwise be able to spot this type of rubbish get drawn in all the same. Trumpeting all this to the press is just a way of getting investors or to try to get some other company to acquire them.

William Gaatjes · Jun 26, 2018

If the multi level flash controllers mean anything, they do seem to come up with practical ideas that solve problems everybody else has but looks at differently.
It might very well happen that some of the prodigy tech ends up in all future upcoming processors within a few years as patents.

DaveSimmons · Jun 26, 2018

William Gaatjes said:
If the multi level flash controllers mean anything, they do seem to come up with practical ideas that solve problems everybody else has but looks at differently.
It might very well happen that some of the prodigy tech ends up in all future upcoming processors within a few years as patents.

Hard to say without knowing what ideas would be used and whether or not they were patentable by this company. For example most of the basic ideas behind cryptocoins and blockchain came from university researchers and no one can get patents on them. So the ideas might either be unpatentable as prior art, or owned by some university.

William Gaatjes · Jun 26, 2018

DaveSimmons said:
Hard to say without knowing what ideas would be used and whether or not they were patentable by this company. For example most of the basic ideas behind cryptocoins and blockchain came from university researchers and no one can get patents on them. So the ideas might either be unpatentable as prior art, or owned by some university.

I wonder what they mean when they talk about wires. Maybe they have circumvented some physics issue. I mean, they state they are a semiconductor company.

The Prodigy architecture is the result of decades of experience that I developed designing processors (e.g. Playstation 2, Tesla), flash memory controllers (Sandforce), and flash based systems (Skyera). Several years of self-funded R&D preceded Tachyum’s emergence from stealth mode. I have always been interested in solving “device physics” challenges, such as reliability issues in dual level cell flash memory, as I did at Sandforce. Prodigy is another example of that. With the decade long stagnation of processor clock speed, due in large part to slow wires relative to transistor switching speed, and coupled with CPU architectures which were designed when wires were infinitely fast compared to transistors, a fresh look at an optimal 21st century processor architecture was warranted. We started from a clean sheet of paper with a design philosophy of reducing the number of slow wires on a chip, and reducing the average length of existing wires. The result is breakthrough performance and low power consumption.

It seems they tackled the problem from multiple angles. The layout design of the chip was seen as the fundamental problem, so they designed an architecture that solves that issue by stripping out functionality tradionally done in hardware and letting the compiler solve it. That is how i read it.
But the first thing to my mind comes with how much bandwidth is there, what use is a fast core with high IPC if you cannot get the beast fed.
I think that is an issue Intel before and now AMD again as well with Zen have been tackling for years.
If you make the core too wide, layout gets more complex and the connections in the chip get longer. That makes me wonder if that is what they talk about.
Make lot of narrow cores that can run like crazy. Use a 3 operand instructionset with fixed width for efficiency and instructions for controlling the threads.
Solve the thread allocation on the compiler. Seems what a lot of people do who make... GPUs.
Like for example Nvida and AMD/RTG.

edit: the text above is my speculation.

PeterScott · Jun 26, 2018

I read through the article, and press releases. This is NOT a GPU replacement for graphics.

It talks about bettering GPUs on AI workloads.

It's conceivable that a startup could make a new simple massively parallel chip that will do very well at data center and AI workloads.

Claims will obviously need to be tested and I am sure we will see a lot of "up to" type claims where they excel at very specific cases.

Definitely something to keep an eye on.

Thala · Jun 26, 2018

William Gaatjes said:
I wonder what they mean when they talk about wires.

They literally mean the wires aka the metal layers and vias. And they are referring to the fact that with a shrinking process technology, the switch speed goes up but signal propagation speed goes down.

Thunder 57 · Jun 26, 2018

Notice all of the I's in there. I did this, I did that, I learned from others mistakes (no mention of learning from his own). Then there was just a bunch of fluff with magical compilers replacing the need for OoO execution. We've been down this road before, we've seen Itanium. I think it's fair to say that I am more than a bit skeptical.

wahdangun · Jun 26, 2018

William Gaatjes said:
I wonder what they mean when they talk about wires. Maybe they have circumvented some physics issue. I mean, they state they are a semiconductor company.

It seems they tackled the problem from multiple angles. The layout design of the chip was seen as the fundamental problem, so they designed an architecture that solves that issue by stripping out functionality tradionally done in hardware and letting the compiler solve it. That is how i read it.
But the first thing to my mind comes with how much bandwidth is there, what use is a fast core with high IPC if you cannot get the beast fed.
I think that is an issue Intel before and now AMD again as well with Zen have been tackling for years.
If you make the core too wide, layout gets more complex and the connections in the chip get longer. That makes me wonder if that is what they talk about.
Make lot of narrow cores that can run like crazy. Use a 3 operand instructionset with fixed width for efficiency and instructions for controlling the threads.
Solve the thread allocation on the compiler. Seems what a lot of people do who make... GPUs.
Like for example Nvida and AMD/RTG.

edit: the text above is my speculation.

It's sound like Itanium all over again, over reliance on compiler is not good.

William Gaatjes · Jun 27, 2018

PeterScott said:
I read through the article, and press releases. This is NOT a GPU replacement for graphics.

It talks about bettering GPUs on AI workloads.

It's conceivable that a startup could make a new simple massively parallel chip that will do very well at data center and AI workloads.

Claims will obviously need to be tested and I am sure we will see a lot of "up to" type claims where they excel at very specific cases.

Definitely something to keep an eye on.

I am not seeing it as a GPU.
But it is very GPU product alike in the sense that it is highly parallel in nature and has a compiler and probably even a profiler to optimize where all the threads are executed and that they are executed as efficient as possible. That is the similarity i am writing about.
Itanium is mentioned a lot.
It makes me wonder if at the time Intel had a profiler software suite as is common now for Nvidia and AMD/RTG have for their gpu systems.

Thala said:
They literally mean the wires aka the metal layers and vias. And they are referring to the fact that with a shrinking process technology, the switch speed goes up but signal propagation speed goes down.

Thank you.

I was thinking of that but was not sure.

coercitiv · Jun 27, 2018

So in order to enforce the credibility of the Prodigy "software enhanced" computing platform, the CEO reminds us of his experience with the "software enhanced" SandForce SSD controllers. I have a bad feeling about this.

scannall · Jun 27, 2018

Is new tech, and new ideas cool? Yep, you bet. But I am dubious about these claims. Put them to the test. I would be happy to be wrong. But I don't think I am.

EXCellR8 · Jun 27, 2018

we'll see...

PeterScott · Jun 27, 2018

William Gaatjes said:
I am not seeing it as a GPU.
But it is very GPU product alike in the sense that it is highly parallel in nature and has a compiler and probably even a profiler to optimize where all the threads are executed and that they are executed as efficient as possible. That is the similarity i am writing about.
Itanium is mentioned a lot.
It makes me wonder if at the time Intel had a profiler software suite as is common now for Nvidia and AMD/RTG have for their gpu systems.

I agree. This is very GPU like, with high numbers, of limited simple execution units. The original title makes it sound like this is a GPU replacement which it isn't for graphics, this is just for compute workloads.

The Itanium connection really isn't as strong as the GPU connection. Itanium really wasn't about a bunch of simple compute units. It was about an oddball Very Long Instruction Word, with high instruction parallelism, that had the compiler sort the mess out. It was aiming to go in the opposite direction of RISC.

This looks a lot more like a GPU with an ultra small instruction function set, just with huge number of identical units in parallel. Nothing like Itanium.

I don't think the compiler issue will be that big of deal, it will be more like programming for OpenCl/CUDA.

The real issue is if the claims really stand up to scrutiny.

William Gaatjes · Jun 27, 2018

PeterScott said:
I agree. This is very GPU like, with high numbers, of limited simple execution units. The original title makes it sound like this is a GPU replacement which it isn't for graphics, this is just for compute workloads.

The Itanium connection really isn't as strong as the GPU connection. Itanium really wasn't about a bunch of simple compute units. It was about an oddball Very Long Instruction Word, with high instruction parallelism, that had the compiler sort the mess out. It was aiming to go in the opposite direction of RISC.

This looks a lot more like a GPU with an ultra small instruction function set, just with huge number of identical units in parallel. Nothing like Itanium.

I don't think the compiler issue will be that big of deal, it will be more like programming for OpenCl/CUDA.

The real issue is if the claims really stand up to scrutiny.

I agree.

dogen1 · Jun 27, 2018

wahdangun said:
It's sound like Itanium all over again, over reliance on compiler is not good.

Itanium's main problem was non deterministic memory latency, at least as far as i understand it.

William Gaatjes · Jun 27, 2018

Thala said:
They literally mean the wires aka the metal layers and vias. And they are referring to the fact that with a shrinking process technology, the switch speed goes up but signal propagation speed goes down.

Now that i am home i have finally time to think about it and read up about it , i think i understand all the factors now.

When the process gets smaller, the resistance of the "wires" increases but since high speed signals also propagate through it, skin effects also becomes a role. As does capacitive effects.
Mainly relative permittivity (dielectric constant) and to some extend permability are the issues. At least that is what i understand of it.
Long time ago we had at work some experiments with a time domain reflection meter and the principle behind it. That was all about the velocity factor. And that was dependent on the material surrounding the conductor. Also was all about matching impedance to prevent reflections.
Even at chip layouts, the impedance must be matched yes to prevent reflections ?

For reflections to occur, the wire must be way longer so that the propagation time of the signal is much longer than the rise time of the signal and the impedance is not matched.

An advantage of smaller process technology would be that the propagation time of the signal is shorter but at the same time the relative permittivity increases. And thus the propagation speed goes down as well. There goes the advantage out of the window again.
Because the relative permittivity increases, the coulombforce on the electrons increase. That would make for less easy passing on of the " charge" signal from electron to electron.

And there is something that is confusing me.
I always understood that electron do not really travel that fast through a material because of mainly scattering and other atomic forces.
Thus when a signal is applied with very short rise time, the passing of the charge (the signal wave front) goes with the velocity factor or signal propagation speed but the electrons move at a relative slow pace through the material. But the electrons are very good at passing on information.
I recently read about ballistic conduction where an electron can travel through a material without scattering. Without encountering resistance but also not being a super conductor.
It is just that at small sizes, the electron encounters less interaction from surrounding atoms. At least i think that is the case of everything influences everything when interacting.

Does my rambling make any sense at all ?

ClockHound · Jun 27, 2018

William Gaatjes said:
Does my rambling make any sense at all ?

Yes, it does. And it's worrying. In the instant gratification era who has time to wait for slow electrons?

aigomorla · Jun 28, 2018

Sigh, it doesnt matter how fast a cpu and gpu is if there is no simple development kit for programmers to write code for it.

This is why the x86 arch is still used, and slowly ARM taking over.

Its like saying what is the point in a car which can go really really fast, if it can only go straight?

coercitiv · Jun 28, 2018

A bit of a reminder from the memory lane:

The Secret of Denver: Binary Translation & Code Optimization

NVIDIA’s decision to forgo a traditional out-of-order design for Denver means that much of Denver’s potential is contained in its software rather than its hardware. The underlying chip itself, though by no means simple, is at its core a very large in-order processor. So it falls to the software stack to make Denver sing.

Accomplishing this task is NVIDIA’s dynamic code optimizer (DCO). The purpose of the DCO is to accomplish two tasks: to translate ARM code to Denver’s native format, and to optimize this code to make it run better on Denver. With no out-of-order hardware on Denver, it is the DCO’s task to find instruction level parallelism within a thread to fill Denver’s many execution units, and to reorder instructions around potential stalls, something that is no simple task.

Running code translation and optimization is itself a software task, and as a result this task requires a certain amount of real time, CPU time, and power. This means that it only makes sense to send code out for translation and optimization if it’s recurring, even if taking the ARM decoder path fails to exploit much in the way of Denver’s capabilities.

This sets up some very clear best and worst case scenarios for Denver. In the best case scenario Denver is entirely running code that has already been through the DCO, meaning it’s being fed the best code possible and isn’t having to run suboptimal code from the ARM decoder or spending resources invoking the optimizer. On the other hand then, the worst case scenario for Denver is whenever code doesn’t recur. Non-recurring code means that the optimizer is never getting used because that code is never seen again, and invoking the DCO would be pointless as the benefits of optimizing the code are outweighed by the costs of that optimization.

The good news is in the case of Prodigy workloads would likely be easier to profile and optimization will be extracted for each case. The bad news is bigger fish have already tried replacing complex hardware with even more complex software, and failed. IIRC the case of the Sandforce controller the performance numbers were there at least, in the case of Denver, not so much.

ksec · Jun 28, 2018

Note these numbers down and people should figure it out themselves.

How much is the Cost of CPU? Even in 10s of Thousands? And how much time you need to install those.

How much cost of programmers, especially in their special domains, their salary in hundreds, every month. And how much time you need to rewrite your application.

I am convinced, especially on the server, any future innovation will have to be built on top of x86. Until x86 stagnant, before anything else could topple it.

NostaSeronx · Jun 28, 2018

coercitiv said:
In the case of Denver, not so much.

Denver wasn't wide. It was a puny, tiny, weak core made for wimps by wimps. CISC instructions are more efficient in re-encoded ISA architectures.

If you want to see a working implementation of:
1. Need more ALUs, at least more than 8.
2. Need more FPUs, at least more than 8.
3. Need Vector Address-Generation, capable of feeding the beast efficiently.

Less is not more when re-encoding!

To achieve its lofty performance goals, Tachyum has designed a new 64-bit architecture that combines elements of RISC, CISC, and VLIW. The company says it will not only beat today’s Xeons but will also compete strongly with GPUs on machine learning. In sum, Tachyum is making many promises that require extraordinary effort to fulfill.
- http://www.linleygroup.com/newsletters/newsletter_detail.php?num=5870&year=2018&tag=3

Intel will do it first, probs.
https://images.anandtech.com/doci/10025/Presentation (10).jpg
https://www.anandtech.com/show/10025/examining-soft-machines-architecture-visc-ipc/5

Clustered Simultaneous Multithreading with virtual core/thread allocation by Intel will curb stomp Tachyum.

Nothingness · Jun 28, 2018

NostaSeronx said:
Denver wasn't wide. It was a puny, tiny, weak core made for wimps by wimps. CISC instructions are more efficient in re-encoded ISA architectures.

Do you consider 7-way as not wide?

If you want to see a working implementation of:
1. Need more ALUs, at least more than 8.
2. Need more FPUs, at least more than 8.
3. Need Vector Address-Generation, capable of feeding the beast efficiently.

Less is not more when re-encoding!

You forgot to count branch units. To sustain 8 ALU you need to at least resolve/predict 2 branches per cycle.

[techradar] This startup wants to kill the CPU and GPU in one go

Diamond Member

Elite Member

Lifer

Diamond Member

Lifer

Elite Member

Lifer

Platinum Member

Golden Member

Platinum Member

Golden Member

Lifer

Diamond Member

Golden Member

Diamond Member

Platinum Member

Lifer

Senior member

Lifer

Golden Member

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Diamond Member

Senior member

Diamond Member

Platinum Member