More testing. This time I had all 4 threads referencing the same tree instance, which lead to some interesting results:
I tried to force less than optimal scenario by setting affinity to 0,2,8,10 and referencing same instance on all threads. Having one copy means that it can only reside in cache of one CCX at a time so with affinity set to run threads on both CCX we are bound to see increased latencies. Indeed quite frequently all threads have their latencies at around ~600-700ns. There is much more variation between runs compared to optimal scenario with two copies and affinity, where the latencies sit at around ~350ns on all threads. Though there is more variation with single copy, there is also consistency between the threads and their latencies tend to be within 20ns from one another.
So what about all 4 referencing same instance without affinity? Interesting enough results here are much the same as with 0,2,8,10 affinity. Similar latencies between ~600-700ns and similar consistency between threads.
Ok what about all 4 referencing same instance using one CCX with 0,2,4,6 affinity? Well just like you might expect, results are back to ~350ns on all threads given that again we have an optimal scenario with data residing in cache of same CCX as thread execution.
Last one I tested this time around was affinity of 0,1,2,3 meaning we stick to one CCX but only two physical cores. Once again consistent results with all threads at around ~450ns latency. Only some 100ns more than with 4 physical cores. SMT seems to be doing what it was built for here; context switching at blazing speeds