what do you think about making radically parallel architecture?

Anarchist420 · Oct 19, 2013

http://www.youtube.com/watch?v=8pTEmbeENF4#t=24m7s

See the video above. one of the most brilliant programmers on any forum ive been to recommended it to me. i havent watched it yet.

im asking because i want the successor to rasterization to be done in software on a non-fixed function architecture that is a good as fixed function hardware units.

currently, hardware may be faster, but that's the only inherent benefit to it... you cant do anything you want with it.

SecurityTheatre · Oct 19, 2013

But it is the fact that it IS single-purpose that makes reasonable costing massively parallel computation (such as in a GPU) economical.

Multiplying the execution units by 4000 would make a multipurpose CPU cost prohibitive. Sure you can eliminate half the complexity by ditching some of the OOP and speculative execution components, but that doesn't change the overall complexity too much.

videogames101 · Oct 22, 2013

So, you can't make general function hardware as fast as fixed function hardware. That's an inherent trade-off.

As an example, look at bitcoin mining. CPU mining was slow. Sure, doing it a general GPU was faster. However, when you design an ASIC to do the same job it is orders of magnitude faster.

What you're talking about would require some kind of breakthrough in available computation power for general purpose processors.

Ken g6 · Oct 24, 2013

I have, for years, expected chip designers to stick an FPGA on a CPU, along with an API to write input and get output. It seems to me that programming in Verilog has to be easier than massively parallel programming with something like MPI. A really smart design would break the FPGA up into segments, each serving one thread. Maybe now that chips aren't getting faster as quickly, something like this will happen soon.

SecurityTheatre · Oct 24, 2013

From 2010:

demand for an ASIC substitute remains strong, so companies keep striving to meet the challenge. The latest attempts are from Intel, STMicroelectronics, and Xilinx—and Altera is promising future announcements. All these companies are crafting new ways to combine hard CPU cores with programmable logic.

Intel’s solution is the Intel® Atom™ processor E600C Series (codename Stellarton), which pairs an Intel Atom processor SoC with an Altera FPGA in a multichip package. These products even have integrated graphics. But the key feature is programmable logic, which allows developers to add application-specific functions that normally would require spinning a custom ASIC.

At 1.3GHz, the Intel® Atom™ processor E665C is the fastest member of the series. In one package is a single-core dual-threaded Intel Atom processor SoC (codename Tunnel Creek) and an Altera Arria-II FPGA. The two die are bonded side-by-side and linked over a 1x1 PCI Express Gen1 interface. Notably, the Intel Atom processor E600C series is the only hybrid processor of this type supporting the Intel® architecture (x86).

The FPGA die is from Altera’s midrange family. It has 63,250 reprogrammable logic elements, the equivalent of about 759,000 ASIC gates—roomy enough for customization. It also has 5.3Mbits of block RAM and 312 18x18-bit multipliers (often called DSP blocks by FPGA vendors). Because the multipliers are hard wired, they free all the programmable logic for custom development. I/O interfaces include eight 3.125Gbps serdes lanes and 364 I/O pins capable of various signaling standards.

One limitation is that all communications between the FPGA and the Intel Atom processor must funnel through the 1x1 PCIe channel linking the two die. This bottleneck makes offloading some tasks from the CPU to the FPGA impractical. For many embedded applications, however, this channel is sufficient.

They aren't really a general purpose acceleration scheme, but rather a means to solve specific problems that are difficult to do in software.

Problem is that FPGAs are slow and expensive compared to ASICs and, except for specific purpose applications. Without careful design, FPGAs are often slower than comparable code written for a good fast general purpose CPU (and harder to develop).

It's super cool to implement some control circuitry in an FPGA without having to bother with realtime interrupts and driver management, etc in a PC (or building an ASIC), but for things that your Core i7 can do reasonably efficiently, using an FPGA isn't probably the optimal choice, anyway.

It's not that it's impossible, nor that it's actually that bad an idea, but it is a complex one with specific use cases that it really benefits.

Disclaimer: My FPGA experience is 12 years out of date and I understand things have changed slightly, though my impression is that this is still the case for the most part.

exdeath · Oct 24, 2013

I think compute performance is irreverent. We need to solve the IO bottleneck first.

Every time we get faster processors and double the data size, disk and network performance gets worse and worse.

We should be racing for universal non volatile 1:1 main memory before worrying about yet more processor speed that goes idle waiting on IO interrupts.

aigomorla · Oct 24, 2013

exdeath said:
I think compute performance is irreverent. We need to solve the IO bottleneck first.

Every time we get faster processors and double the data size, disk and network performance gets worse and worse.

We should be racing for universal non volatile 1:1 main memory before worrying about yet more processor speed that goes idle waiting on IO interrupts.

i totally agree with you.

Lets us ditch RAM... and just use a SSD type media for both ram + storage requirements.

However we cant cuz RAM is a lot faster then SSD's.

A5 · Oct 25, 2013

Ken g6 said:
I have, for years, expected chip designers to stick an FPGA on a CPU, along with an API to write input and get output. It seems to me that programming in Verilog has to be easier than massively parallel programming with something like MPI. A really smart design would break the FPGA up into segments, each serving one thread. Maybe now that chips aren't getting faster as quickly, something like this will happen soon.

There are plenty of embedded systems that combine a fixed-function CPU with a programmable logic section. This used to be the realm of PowerPC and MIPS, but the latest designs have ARM CPUs on them.

They're hideously expensive for non-specialist applications, though. One board I worked on had the Xilinx Virtex 5 FX100T, which gives you 2 slow PowerPC cores and a big chunk of FPGA logic for the "low" price of $2200/chip.

Ben90 · Oct 25, 2013

exdeath said:
I think compute performance is irreverent. We need to solve the IO bottleneck first.

Every time we get faster processors and double the data size, disk and network performance gets worse and worse.

We should be racing for universal non volatile 1:1 main memory before worrying about yet more processor speed that goes idle waiting on IO interrupts.

As needs to be mentioned every time you bring this up, storage and memory bandwidth is largely a solved problem with current SSDs.

Going from 1GB/s to 50Gb/s main storage speed only barely shows an improvement in all but the most demanding scenarios. Yes, copying a BluRay rip might take an extra few seconds, but they make medicine for that if you are stressing out over it. Things like opening fat Excel files which used to take minutes on a HDD are now clocking sub second times. From pressing the power button to having Chrome open takes 15 seconds on modern computers. The majority of what you are waiting for is unnecessary software interrupts.

On the RAM side, going from 50Gb/s to infinite only shows improvements for compression, and very slight improvements on the IGP. Latency from RAM to CPU is a much larger issue that memristors don't actually solve. Cutting that in half would be more beneficial in most scenarios than infinite bandwidth.

Anarchist420 · Nov 16, 2013

would be nice to see programmable blending and depth though, with full speed double precision. im just not convinced that hardware is right for blending and depth.

would've really been nice to have at least seen all z-buffer hardware formats removed except for D32FPS8X8 and D32FX along with all hw removed for render targets less than RGBA16 precision removed. and having non-cripped double precision compute performance wouldve been nice too.

glugglug · Nov 16, 2013

SecurityTheatre said:
But it is the fact that it IS single-purpose that makes reasonable costing massively parallel computation (such as in a GPU) economical.

Multiplying the execution units by 4000 would make a multipurpose CPU cost prohibitive. Sure you can eliminate half the complexity by ditching some of the OOP and speculative execution components, but that doesn't change the overall complexity too much.

I'm not at all convinced on this.....

Look at CPU architectures from the 70s and 80s, like the 6502 used in the Apple ][, Commodore 64, and others.

It had only 96 instructions. That is counting every possible addressing mode of an operation as a different instruction. That is counting "Increment X register" as a different instruction from "Increment Y register".

Really, if you took the addressing mode and register being used as a separate dimension from the "instruction" the count would probably be around 20-25 instructions. I'm pretty sure that's less than any GPU.

Cerb · Nov 17, 2013

Pro apps are already starting to use software rendering pipelines, using CUDA or GLSL. The performance just isn't there to replace rasterization. Once Haswell starts becoming the common low-end, I would expect it to become even more common. But, as long as you need high performance, it's not going t happen. Luckily, your depth and blending issues were taken care of 10 years ago or more, and are now non-issues.

With more cores, and more programmable GPUs (namely, sharing the same virtual memory), we're already working on such highly parallel computing. Working by results and constraints is also nothing new. The trick, that's consistently being worked on, in merging that with the paradigm of telling the computer what to do, because the two are, when working in the same set of code, are square pegs and round holes.

NTMBK · Nov 20, 2013

Xeon Phi. A bootable CPU with 72 cores, 288 threads, all in one socket.

SecurityTheatre · Nov 20, 2013

glugglug said:
I'm not at all convinced on this.....

Look at CPU architectures from the 70s and 80s, like the 6502 used in the Apple ][, Commodore 64, and others.

It had only 96 instructions. That is counting every possible addressing mode of an operation as a different instruction. That is counting "Increment X register" as a different instruction from "Increment Y register".

Really, if you took the addressing mode and register being used as a separate dimension from the "instruction" the count would probably be around 20-25 instructions. I'm pretty sure that's less than any GPU.

The complexity of offering REAL instructions operating at a high frequency is non-trivial.

You can cite an Apple ][ but, the 6502 processor was a low-clock asynchronous non-pipeline design. Even modern renditions of the chip run at 2Mhz or less.

It's not capable of floating point operation and so implementations of modern software would run abysmally slow in software emulation.

Looking forward to chips from the era that supported floating point operation in hardware, we find things like the 8087 FPU. Sure, it's 45,000 transistors (simple), but it is also a non-pipelined design. As such, even on a modern process, would probably run WELL under 1Ghz (closer to 100Mhz, I'd guess). At the same time, it can take up to 150 CPU cycles just to do a single FMUL operation.

This is WAY faster than in software emulation, which would require tens of thousands of cycles to do the same thing, but still, it's not what you're looking for when you talk about performance IMPROVEMENTS using massively parallel systems.

The first chip that could do important operations like FMUL in under 20 cycles was the 486DX series, but now we're talking about several million transistors per core, and still only a 2-3 stage pipeline, limiting best-case frequency scaling to a fraction of a modern CPU.

Modern chips do FMUL in 2-3 cycles and do them at multi Ghz speeds. That's a non trivial accomplishment that cannot be compared to a MOS 6502 from an Apple ][.

This isn't even addressing SIMD (SSE, MMX, etc), memory addressing, speculative execution, OOP, branch prediction, cache coherence, transaction lookaside buffers, cache associativity and all the other performance enhancements that make modern computers not suck so much and requirements for multi-core processing. Then there are security enhancements like NX bits, virtualization enhancements like VT, etc, etc.

Giving them all up in favor of 100 CPUs could speed up certain applications, such as high-volume math operations, but would drastically slow down operations that are linear in nature, which almost half of execution will be, by the very nature of computing.

I think the "large complex, speculative, OOP CPU" combined with a "multi-core specialized ASIC" is the way to go, in much they way we currently use it with CPU/GPU.

What I was saying in my original post is that, combining the features we expect from a multipurpose CPU, you won't be able to scale it by a factor of 100.

If you want to ditch Floating Point, and cache coherency and a hundred other features, you could do a multi-purpose CPU that was small enough to put 100 on a die, but you would have a crappy computer.

A series of special purpose engines (in massive parallel configurations), however, might solve this problem. Have 100 GPU-like cores that do a limited number of matrix operations, another few dozen cores that only do basic FMUL/FDIV and another set that handle a highly pipeline speculative integer operations. The combination just might beat out the current CPU/GPU combination, but only if it were very carefully built.

SecurityTheatre · Nov 20, 2013

NTMBK said:
Xeon Phi. A bootable CPU with 72 cores, 288 threads, all in one socket.

This is architecturally, just a Xeon chip with a high end GPU strapped directly onto it, and some APIs for using the GPU in a standardized way.

But yeah, that's the way of the future for compute applications, for sure. The discreet GPU (especially as a dedicated graphics chip) is probably limited. It will come to be seen as more of a "co-processor" in the future, in much the way Intel is advertising it.

glugglug · Nov 20, 2013

SecurityTheatre said:
The complexity of offering REAL instructions operating at a high frequency is non-trivial.

You can cite an Apple ][ but, the 6502 processor was a low-clock asynchronous non-pipeline design. Even modern renditions of the chip run at 2Mhz or less.

It's not capable of floating point operation and so implementations of modern software would run abysmally slow in software emulation.

Looking forward to chips from the era that supported floating point operation in hardware, we find things like the 8087 FPU. Sure, it's 45,000 transistors (simple), but it is also a non-pipelined design. As such, even on a modern process, would probably run WELL under 1Ghz (closer to 100Mhz, I'd guess). At the same time, it can take up to 150 CPU cycles just to do a single FMUL operation.

This is WAY faster than in software emulation, which would require tens of thousands of cycles to do the same thing, but still, it's not what you're looking for when you talk about performance IMPROVEMENTS using massively parallel systems.

The first chip that could do important operations like FMUL in under 20 cycles was the 486DX series, but now we're talking about several million transistors per core, and still only a 2-3 stage pipeline, limiting best-case frequency scaling to a fraction of a modern CPU.

Modern chips do FMUL in 2-3 cycles and do them at multi Ghz speeds. That's a non trivial accomplishment that cannot be compared to a MOS 6502 from an Apple ][.

This isn't even addressing SIMD (SSE, MMX, etc), memory addressing, speculative execution, OOP, branch prediction, cache coherence, transaction lookaside buffers, cache associativity and all the other performance enhancements that make modern computers not suck so much and requirements for multi-core processing. Then there are security enhancements like NX bits, virtualization enhancements like VT, etc, etc.

Giving them all up in favor of 100 CPUs could speed up certain applications, such as high-volume math operations, but would drastically slow down operations that are linear in nature, which almost half of execution will be, by the very nature of computing.

I think the "large complex, speculative, OOP CPU" combined with a "multi-core specialized ASIC" is the way to go, in much they way we currently use it with CPU/GPU.

What I was saying in my original post is that, combining the features we expect from a multipurpose CPU, you won't be able to scale it by a factor of 100.

If you want to ditch Floating Point, and cache coherency and a hundred other features, you could do a multi-purpose CPU that was small enough to put 100 on a die, but you would have a crappy computer.

A series of special purpose engines (in massive parallel configurations), however, might solve this problem. Have 100 GPU-like cores that do a limited number of matrix operations, another few dozen cores that only do basic FMUL/FDIV and another set that handle a highly pipeline speculative integer operations. The combination just might beat out the current CPU/GPU combination, but only if it were very carefully built.

You are very wrong about the clock scaling. Even in the 80s "accelerators" for the Apple II had 14+MHz versions on expansion cards. And I've seen one OC to 50MHz in '92. But the instruction set is missing quite a lot (which was to some degree the point..)

For something not lacking fundamental-used-all-the-time instructions, so that those don't need to all be heavy subroutines, what about 68000+68881 widened to 64 bit?, double the register count of current x86-64 chips, sure there are no vector operations, but overall I'd say the instruction set is more complete than the original Pentium, other than lacking an MMU until the 68030 revision, and you know how they came up with the model number? 68000 is the actual transistor count. Which means to match the transistor count of Haswell would be 19,000 cores. Cutting in half for 32->64 bit transition and rounding, say roughly 10,000 cores in the same die space as current 4 core chips. For apps that parallel, I don't think pipelining is such an issue.

SecurityTheatre · Nov 20, 2013

glugglug said:
You are very wrong about the clock scaling. Even in the 80s "accelerators" for the Apple II had 14+MHz versions on expansion cards. And I've seen one OC to 50MHz in '92. But the instruction set is missing quite a lot (which was to some degree the point..)

For something not lacking fundamental-used-all-the-time instructions, so that those don't need to all be heavy subroutines, what about 68000+68881 widened to 64 bit?, double the register count of current x86-64 chips, sure there are no vector operations, but overall I'd say the instruction set is more complete than the original Pentium, other than lacking an MMU until the 68030 revision, and you know how they came up with the model number? 68000 is the actual transistor count. Which means to match the transistor count of Haswell would be 19,000 cores. Cutting in half for 32->64 bit transition and rounding, say roughly 10,000 cores in the same die space as current 4 core chips. For apps that parallel, I don't think pipelining is such an issue.

First of all, more than 3/4 of the transistors in a modern chip are cache. Those would still be needed (arguably even more needed).

Second, the 68881 takes over 100 cycles per FLOP and has no pipelining, SEVERELY limiting the speed.

A modern-design 68k compatible (with FPU) 64-bit chip with aggressive pipelining and a broad cache, modern TLB, cache coherency etc would be 1-5 million transistors (a guess). They would still be an order of magnitude slower (per core) than existing chips, and you could probably cram 25 (maybe 70-100) of them into a core the size of a modern chip.

But on linear processing (applications that cannot be parallalized due to the nature of the data), it will be on the order of 20-100 times slower than current configurations.

It would also be 10-100 times slower than current GPUs and doing highly structured, threaded math operations that GPUs are currently used for.

There would a seldom-used sweet spot in the middle, which uses complex floating point operations in parallel operation where it might be faster, but I'm not even sure about that.

As for the Apple II "accelerators", they were simply pin-compatible, but didn't use anything resembling the original architecture. The point is moot, regardless because old architectures don't have ANY of the performance enablers that you're taking for granted.

A billion Apple II cores may be "on paper" faster than a single Haswell, but in practice they will never be, because of the volume of code that is linear.

When a single FLOP takes 150 cycles (and must go out to RAM a dozen times), the practical use for a massively parallel version of this chip is negilgible.

If you want to look at "modern" architectures, the 386 chip was the desktop chip to have protected mode operations, 32 bit registers and integrated floating point units and the 486 was the first one to have "true" pipelining as we understand it today (and as it enables the speed we take for granted, so in my opinion, using anything smaller than a 486 (increased in transistor count by 80% to account for 64-bit processing) is the first chip you might even CONSIDER making a massively parallel architecture out of.

Going parallel "at all costs" is not a solution except for some very obscure niche matrix math problems. There needs to be a tradeoff toward linear speed, and clockspeed as well, because a huge fraction of computation is not possible to make massively parallel, by definition as it is dependent on previous operations.

I agree that a RISC design is better than the x86 CISC design, but that's why every chip since the Pentium IS actually a RISC chip, with a translator at the beginning of the process to convert x86 into the internal MicroOp RISC architecture. Much of the obscure x86 assembly is done in microcode, using a combination of CISC operations. I don't see a benefit to "reducing the number of instructions" as you indicated.

Have you ever done any ASIC design, or hardware design?

Anarchist420 · Nov 20, 2013

Cerb said:
Pro apps are already starting to use software rendering pipelines, using CUDA or GLSL. The performance just isn't there to replace rasterization. Once Haswell starts becoming the common low-end, I would expect it to become even more common. But, as long as you need high performance, it's not going t happen.

the performance could be there if given the right architecture and if given the right programmers so it's not a black and white issue. that is, specialized hardware is not always faster and it is certainly less versatile. look at first party ps3 games vs xbox360 games, look at saturn exclusive games vs ps games.

Cerb said:
Luckily, your depth and blending issues were taken care of 10 years ago or more, and are now non-issues.

i dont know about all that.

they werent even using full precision depth buffers until DX10 (nor was order independent transparency required by the DX9 spec if im not mistaken) and 32 bit fixed point precision z is necessary in many cases, look at UT '99... with only 24 bits of z precision there will either be some depth range lost (via a hack to the engine) or a S load of flickering in the distance. and i dont think amd even supports 32 bit float reverse z-buffers in opengl although that may have changed over the past 2 years i dont know for sure.

SecurityTheatre · Nov 22, 2013

glugglug said:
You are very wrong about the clock scaling. Even in the 80s "accelerators" for the Apple II had 14+MHz versions on expansion cards. And I've seen one OC to 50MHz in '92. But the instruction set is missing quite a lot (which was to some degree the point..)

I checked with some friends who used to be active on comp.sys.apple2 and poked around through Google. I found that the fastest accelerator was a 14Mhz that only worked for the Apple II GS and the fastest they had heard of it running was with active cooling and got 20Mhz out of it.

The fastest PC chip at the beginning of 1992 was IIRC a 486DX at 33Mhz using a 4 stage pipeline and a pretty fancy (for the time) 1um manufacturing process.

Not necessarily calling BS, just pointing out that 50Mhz clocks in 1992 were pretty extreme stuff.

The CPUs in these things were primarily still single-stage, multi-cycle execution engines. That means each clock cycle, a single operation is carried out and those operations often take a number of cycles to actually complete.

As such, your 50Mhz accelerator chip took about 10,000 cycles for an (emulated) FMUL operation, which itself is far slower than a 486DX 33Mhz chip that does the same operation in 16 cycles. Modern chips that can do FMUL in 1-3 cycles are extraordinary, but underscore the disadvantage of over simplifying the hardware.

Anyway, that's all.

what do you think about making radically parallel architecture?

Diamond Member

Senior member

Diamond Member

Programming Moderator, Elite Member

Senior member

Lifer

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Elite Member

Lifer

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member