I've just had a frustrating couple of hours trying to diagnose this in a hobby program I've been putting together.
Essentially, the key routine is a large Galois field multiply. Under debug, it worked perfectly, tested over millions of test vectors.
When compiled with optimization on, it would fail on a proportion of test vectors (about 0.1%).
I eventually worked out that this was due to integer overflow during the multiplication. But why the difference?
The multiply was performed in typical school-book long multiplication fashion (with some tweaking to take account of the fact that different coefficients had different radix.)
Through luck, rather more than judgement, the order in which my original code summed the coefficients tended to alternate sign - so that a positive product would then sum with a negative product, etc.
What I found with the optimized version was that various intermediate products were cached and reused. This changed the order of summing in certain places, and would occasionally cause an integer overflow.
I was using 32 bit signed ints, with 13 bit (max) coefficients. I thought that a 26 bit product would fit safely into a 32 bit int. But I had forgotten something - I needed an extra bit (for the radix correction) - OK, that's 27 bits - there's still usable 4 bits left.
The other thing I had forgotten was that with 20 coefficients, some of the coefficients required summing more than 16 intermediate products. In effect, these coefficients needed another 5 bits. And it was, indeed, exactly here that the overflow was occurring when the coefficients were summed in a non-prearranged order.
I fixed it by breaking the summing process in vulnerable "columns" into 2, and performing a carry step part way through the summing.
Essentially, the key routine is a large Galois field multiply. Under debug, it worked perfectly, tested over millions of test vectors.
When compiled with optimization on, it would fail on a proportion of test vectors (about 0.1%).
I eventually worked out that this was due to integer overflow during the multiplication. But why the difference?
The multiply was performed in typical school-book long multiplication fashion (with some tweaking to take account of the fact that different coefficients had different radix.)
Through luck, rather more than judgement, the order in which my original code summed the coefficients tended to alternate sign - so that a positive product would then sum with a negative product, etc.
What I found with the optimized version was that various intermediate products were cached and reused. This changed the order of summing in certain places, and would occasionally cause an integer overflow.
I was using 32 bit signed ints, with 13 bit (max) coefficients. I thought that a 26 bit product would fit safely into a 32 bit int. But I had forgotten something - I needed an extra bit (for the radix correction) - OK, that's 27 bits - there's still usable 4 bits left.
The other thing I had forgotten was that with 20 coefficients, some of the coefficients required summing more than 16 intermediate products. In effect, these coefficients needed another 5 bits. And it was, indeed, exactly here that the overflow was occurring when the coefficients were summed in a non-prearranged order.
I fixed it by breaking the summing process in vulnerable "columns" into 2, and performing a carry step part way through the summing.