Division more costly than multiplication?

aCynic2

Senior member
Apr 28, 2007
710
0
0
I tried looking on Intel's site for machine instruction timings, but I couldn't find anything. I found a doc on opcodes and whether a particular instruction is valid for 64bit use, but I couldn't find the docs on instruction timings.

Nevermind, I think I found something...odd, after 10 years time, that disparity is still pretty big.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Multiplication is pretty simple to implement, especially in binary. Imagine multiplying 2 by 3, using the grade-school arithmetic technique:

______010
_____* 011
_____------
______010
____010_______<- these are called "partial products"
___000
----------------
____00110 = 6

Notice that for each position, you're only multiplying by 0 or 1, which is really easy to do - you either get all zeros, or you get the original input back. As a result, each partial product bit can be computed with one AND gate. Once you have all the partial products, you just add them together using a fancy kind of adder. There are some optimizations you can do that speed up the logic, but you can see that this process can be done very quickly.

Division, on the other hand, is just as ugly in binary as it is in decimal. Simple algorithms look like the grade-school "guess and fix if needed" algorithm, while faster algorithms tend to use lookup tables (which take up chip area). Wikipedia article on divide algorithms. Note that even the fast algorithms are still iterative - while multiplication can be done in one shot (though it's usually broken into multiple cycles because that one shot is slow), division requires iterating repeatedly.

This document gives cycle counts on Athlon (see Appendix F), and this one is for Athlon 64 / Opteron / etc (see Appendix C).
 

EagleKeeper

Discussion Club Moderator<br>Elite Member
Staff member
Oct 30, 2000
42,589
5
0
The only division that is efficient is the one that divides by 2.
 

aCynic2

Senior member
Apr 28, 2007
710
0
0
Originally posted by: CTho9305
This document gives cycle counts on Athlon (see Appendix F), and this one is for Athlon 64 / Opteron / etc (see Appendix C).

Even though I'm programming in Wintel and an Intel fanboy, I think I ended up using that doc.

Ok, so, when possible and when it makes it's more than 1 or 2 divisions, turn the denom into a 1/denom and multiply.

It's been a while, but it's coming back.

I'm wondering now if I should still consider a trig lookup table for the rendering and physics.
 

Thyme

Platinum Member
Nov 30, 2000
2,330
0
0
Are you writing this in assembly? If not, do you really think you are going to gain improvements over your compiler?
 

aCynic2

Senior member
Apr 28, 2007
710
0
0
No, I'm using MS C, but yes, I do expect to massage a few extra cycles out of the CPU by unrolling loops and multiplying by the reciprocal rather than dividing by the denominator, early outing of dead-end calculations, overlapping error checking with decision tests, etc.

How do you think game developers get so much speed out of such heavy visuals along side the physics they utilize?

The current peripheral program I'm working on now is calculating physical data for complex and concave 3D models as large as the empire state building using a sampling size of 1mm in a span algorithm. Anything I can do to move it along a tad faster, I'll use.
 

Schadenfroh

Elite Member
Mar 8, 2003
38,416
4
0
Well, here is how you do it in assembly (on a 368).

(say the two numbers you are working with are stored in the ebx and the ecx registers and you output to whatever using the eax register)

Multiplication:
mov eax, ebx
imul eax, ecx
;do whatever the heck you want to do with the result here

Division...
mov eax, ebx
cdq
idiv ecx
;do whatever the heck you want to do with the result here

mov eax, edx
;do whatever the heck you want to do with the remainder here


I assume that it is more costly to the CPU to divide than multiply by looking at the number of steps in the assembly code, although I could be wrong about that (I assume that is the reason that they force us to learn assembly... either that or make us appreciate C and/or C++) :p
 

exdeath

Lifer
Jan 29, 2004
13,679
10
81
^ different instructions have different completion times and start/stop latencies. Also more complicated instruction can tie up several simple execution units and prevent other instructions from running at the same time.

The rest of the instructions don't even matter so much for this discussion, just comparing idiv to imul is bad enough.

Division is actually relatively cheap on modern CPUs, it used to be several hundred clock cycles.

To Op:

Those tricks aren't really necessary in games any more. Much of the old school optimizations with regards to divsion had to do with perspective division and clipping, both of which are either implemented in hardware or handled in optimized code in the API already.

You'd be better off learning the graphics API and the implications of certain features in drivers and hardware (using proper lock flags for dynamic vertex buffers, putting vertex buffers in the proper memory location, alignment and padding, batching and minimizing API calls and state changes, vertex transform cache coherency, etc.)

Not to say that if you have extensive divides in loops that run 1000s of iterations isn't going to save some time with optimizations, but the kinds of mission critical make or break stuff you are thinking of from the Doom/Quake days are over for the most part. We don't use sin/cos tables, fixed point math, and avoid division like the plague these days.

Unrolling loops can also be more harm then good with today's disparity between CPU and memory speed, as it makes inefficient use of the CPUs caches and branch prediction hardware. Rather than unrolling loops, you'd be better off learning about underlying cache operation and know concepts such as:

-performing dummy reads on data[j+1] at the start of your loop before you work on data[j] to prime caches concurrently.

-priming loops the right way in such a way in C that you let the CPU know the likely outcome of the majority of your loop's results to purposely leverage accurate branch prediction, etc.

-anything you can do to 'hint' to the CPU what you want, all of which can be done in ordinary C just by rearranging a few lines.

-decoupling code and data so code is re-entrant (state of the data being operated on is saved in the data and independent of any function iteration) and independent of other thread results and thread safe so you can leverage multi threading on modern multi core CPUs with minimum or no synchronization

-take advantage of asynchronous hardware operation and decouple the CPU from other hardware via pipelining to exploit hardware parallelism. submit frame n before you start working on frame n+1, don't ever wait for anything to finish except during a major synchronization event where you expect everything to be nearly done anyway.

-not mixing integer and floating point math inadvertently or purposefully casting so much such that the C runtime ftoi is called which changes the rounding mode on the FPU and stalls the FPU each time.

Time and profile your code before you waste time 'optimizing' For any kind of gaming, with a modernly fast CPU and graphics card, you'll probably find that most of your time is wasted in inefficient API calls, especially synchronous flushes where you idle the CPU pending completion of the graphics call, and long complex shaders. Always profile before and after any supposed optimizations to see if you've made it better or worse or if it's even an area worth wasting further time.
 

aCynic2

Senior member
Apr 28, 2007
710
0
0
Originally posted by: exdeath
Time and profile your code before you waste time 'optimizing' For any kind of gaming, with a modernly fast CPU and graphics card, you'll probably find that most of your time is wasted in inefficient API calls, especially synchronous flushes where you idle the CPU pending completion of the graphics call, and long complex shaders. Always profile before and after any supposed optimizations to see if you've made it better or worse or if it's even an area worth wasting further time.

I'm finding I have forgotten much since I stopped programming a decade ago, and this is in regard to efficient program logic. I'm being terribly redundant to the point it is not only slowing dow the process, but it screwing up the program altogether.

All games should be this good!

WOW!!!

Sweet trailer!

I should get a PS3.