@Doug S those are good points that i agree with in general, but we have to draw a line somewhere, as there are benchamarks from all camps:
1) We have benchmarks where source code is available - for example SPEC that follows what You wrote to the letter, they even allow to override malloc library use for allocations and use some custom heap library. And then compiler and vendor games fully start and results between two vendors or even two systems are not really comparable and who knows what they mean? They have to dislose setup, flags, but all sorts of crazy compiler, malloc tunes happened in the past.
Even when testing is done by one house like Anandtech used to, it still left a lot of question like "should we use latest compiler, what is vendor A compiler produced code scores that are in fact better on vendor B system, but noone will use it in real world etc.
2) Black box benchmarks, that focus some narrow parts of performance spectrum, the usual suspects are well known. At some point they have same connundrum of benchmark evolution and "optimization" as (A), say some sort of hardware video encoding acceleration is now available, should it be represented in their benchmark? What if whole field has changed and noone is using CPU to render images anymore? What about Apple or AMX acceleration if they test some CPU AI/DL workload? Lot's of questions here.
3) The effort from company that produces GB5 suite, that is closed source, with binaries provided, but who seem to get better with years with their benchmarks. They are not really calling any libraries and target CPU as i believe is important on mobile and desktop. I have already posted above that i like their choice of benchmarks, for example gone are the AES tests, gone are FFT. Heck, even thing like "jpeg encode speed" are no longer tested -> mobile phones have dedicated hardware and on desktop if anyone needs a mass JPEG processing pipeline for professional work, well they need to get their head checked if they are not using specialized libs and/or CUDA. In its place is a workload that uses DL to clasify images and dealing with tags -> much wider task than just encode/decode JPEG.
See for Yourself what they test, makes a lot of sense to me.