I more or less agree with this notion too. If a legitimate compiler optimization breaks a benchmark that doesn't necessarily make the optimization wrong, it makes the benchmark bad. If the compiler optimization does nothing but break that benchmark then the optimization is dishonest.
As far as I'm concerned you can't break a non-synthetic benchmark, and generally you can't even break a good synthetic benchmark. nbench is quite bad (some parts worse than others). It's also very very old. If the writers realized this part could be broken like this, which they should have but may not have, they may have also thought no compiler would bother because compilers were a fair bit more primitive back then.
110% agree, especially with the bolded part.
I remember well when SUN broke the SPEC benchmark by some compiler optimization that improved scores in just one test out of the suite by something like 10x (or was it even more).
So they scored the same in 14 out of 15 tests, but in that one test their score went from something like 8.7 to an astonishing 93 (IIRC), which then pulled the weighted average in such a way that it seemed like they were suddenly kicking Power6's ass in Spec.
Completely broke the purpose of the benchmark, and the compiler trick wasn't really applicable to real world software.
Everyone knew it, the subtest results were there for all to see...but that didn't stop SUN marketing from hyping and grandstaging their uber processor for its spec scores
(and yes, the irony never escaped me that while we were SUN's foundry, I knew their chips were crappy and delivered bottom-tier performance, the Atom's of the big-iron world )