I don't downplay any engineering, I just point out that where engineering ARM core they can for example just make their load-store engines bigger and more performing where x86 engineers are handicapped with legacy junk like total store order and instead of widening their design they just keep making complex hardware that masks cpu core to maintain that legacy compatibility like TSO. TSO means just that other cores see writes from other cores in program order thus not needing any other sync instructions - in poorly written and designed multithreaded code- which prevents cpu's store pipeline to rearrange cache-stores, like when there's 50 waiting stores to cache where needed cache lines are in L1 cache but one line is missing core can't do any stores before that missing line is loaded to cache. With weakly ordered memory model code execution can go on as long as there's free store buffer slots. And with TSO cpu core can't reuse store buffer slots which contains needed address's data if there is some other store between them - weakly ordered core could so needing less cache traffic and stalling core execution is more unlikely.