Guys I hate to throw a wet blanket on this (I'm sort of amused that we have people messing around with GPUView voluntarily in their spare time
), but I think it's time to ask what the point in this exercise is? As a few people have sort of argued on the previous page, none of the results from synthetic workloads are going to generalize to "real" workloads here, as by its very nature this is all completely workload and architecture dependent. Trying to drill into specifics of the implementation on one piece of hardware (reverse engineering) is interesting from a curiosity point of view,
but don't be under the illusion that these tests are going to be predictive of real workload performance, or even that one "real" async compute workload is going to be predictive of another. It's roughly like saying that one architecture is "good at compute" or some similarly general/meaningless statement.
Of course I don't realistically expect people to stop digging - as it is sort of fun -
but just keep the limitations of this data in perspective and remember that fanboys all across the internet like to grab bits of data out of context from here to support whatever preconceived notions or brand loyalty they have Let's not be those folks here at least.