^ different instructions have different completion times and start/stop latencies. Also more complicated instruction can tie up several simple execution units and prevent other instructions from running at the same time.
The rest of the instructions don't even matter so much for this discussion, just comparing idiv to imul is bad enough.
Division is actually relatively cheap on modern CPUs, it used to be several hundred clock cycles.
To Op:
Those tricks aren't really necessary in games any more. Much of the old school optimizations with regards to divsion had to do with perspective division and clipping, both of which are either implemented in hardware or handled in optimized code in the API already.
You'd be better off learning the graphics API and the implications of certain features in drivers and hardware (using proper lock flags for dynamic vertex buffers, putting vertex buffers in the proper memory location, alignment and padding, batching and minimizing API calls and state changes, vertex transform cache coherency, etc.)
Not to say that if you have extensive divides in loops that run 1000s of iterations isn't going to save some time with optimizations, but the kinds of mission critical make or break stuff you are thinking of from the Doom/Quake days are over for the most part. We don't use sin/cos tables, fixed point math, and avoid division like the plague these days.
Unrolling loops can also be more harm then good with today's disparity between CPU and memory speed, as it makes inefficient use of the CPUs caches and branch prediction hardware. Rather than unrolling loops, you'd be better off learning about underlying cache operation and know concepts such as:
-performing dummy reads on data[j+1] at the start of your loop before you work on data[j] to prime caches concurrently.
-priming loops the right way in such a way in C that you let the CPU know the likely outcome of the majority of your loop's results to purposely leverage accurate branch prediction, etc.
-anything you can do to 'hint' to the CPU what you want, all of which can be done in ordinary C just by rearranging a few lines.
-decoupling code and data so code is re-entrant (state of the data being operated on is saved in the data and independent of any function iteration) and independent of other thread results and thread safe so you can leverage multi threading on modern multi core CPUs with minimum or no synchronization
-take advantage of asynchronous hardware operation and decouple the CPU from other hardware via pipelining to exploit hardware parallelism. submit frame n before you start working on frame n+1, don't ever wait for anything to finish except during a major synchronization event where you expect everything to be nearly done anyway.
-not mixing integer and floating point math inadvertently or purposefully casting so much such that the C runtime ftoi is called which changes the rounding mode on the FPU and stalls the FPU each time.
Time and profile your code before you waste time 'optimizing' For any kind of gaming, with a modernly fast CPU and graphics card, you'll probably find that most of your time is wasted in inefficient API calls, especially synchronous flushes where you idle the CPU pending completion of the graphics call, and long complex shaders. Always profile before and after any supposed optimizations to see if you've made it better or worse or if it's even an area worth wasting further time.