- Feb 2, 2008
- 2,219
- 221
- 101
In the interest of benchmarking — specifically considering using Linux to test some things, I happened upon Funtoo Linux, a release of Gentoo that has profiles for most every processor so one can build one's system to extract maximum performance (maximum in terms of taking into account the limitations of the distro itself), at the cost of more effort to get everything up and running.
The thing that I'm wondering, though... The page below talks about AVX performance being a problem and yet AMD's compiler optimization profile enables AVX, the same profile that's used by Funtoo -march=bdver2 although also with -O2 -pipe. An answer on the page also suggests a workaround from Agner Fog and seems to suggest that XOP might also want to be avoided. I'm not sure why that would be. Does XOP also have performance problems?
http://stackoverflow.com/questions/33460592/forcing-avx-intrinsics-to-use-sse-instructions-instead#
etc.
What I'm mainly wondering is if the bdver2 profile should be modified for better performance under Funtoo, for instance, and, if so, how.
Also, is it more useful to just compile individual programs with something like bdver2 rather than everything, making it possible to skip something complex like Gentoo in favor of an easier-to-use and possibly faster distro? Benchmarks I've seen on Phoronix show big swings in terms of performance from distro to distro, depending on the test. But, it seems that Intel's distro benefits in CPU tests from the instruction optimization Intel has done by quite a bit.
Of course, the other question is... has anything changed since this article in 2012, in terms of the benefits to be had:
I know Piledriver is old news but I'm curious anyway. I'd rather not have to spend a huge amount of time and effort to do some benchmarking but I'd like to do more than just run generic compiles in Ubuntu if there is a benefit to be had by making a bit more of an effort. However, if the Gentoo/Funtoo distro itself is going to be so much slower than something like Ubuntu it seems like it would make more sense to just compile the apps themselves with the processor-specific optimization rather than the whole OS.
The thing that I'm wondering, though... The page below talks about AVX performance being a problem and yet AMD's compiler optimization profile enables AVX, the same profile that's used by Funtoo -march=bdver2 although also with -O2 -pipe. An answer on the page also suggests a workaround from Agner Fog and seems to suggest that XOP might also want to be avoided. I'm not sure why that would be. Does XOP also have performance problems?
http://stackoverflow.com/questions/33460592/forcing-avx-intrinsics-to-use-sse-instructions-instead#
Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes.
there is a solution for this. Agner Fog's vector class. Use a AVX vector such as Vec8f and compile with -D__SSE4_2__ -D__XOP__.
...
If you don't want to use XOP don't use -D__XOP__.
etc.
What I'm mainly wondering is if the bdver2 profile should be modified for better performance under Funtoo, for instance, and, if so, how.
Also, is it more useful to just compile individual programs with something like bdver2 rather than everything, making it possible to skip something complex like Gentoo in favor of an easier-to-use and possibly faster distro? Benchmarks I've seen on Phoronix show big swings in terms of performance from distro to distro, depending on the test. But, it seems that Intel's distro benefits in CPU tests from the instruction optimization Intel has done by quite a bit.
Of course, the other question is... has anything changed since this article in 2012, in terms of the benefits to be had:
bdver1 was faster or equivalent in every test, maybe because of FMA4? I know Open64 isn't considered a fast compiler, though, from what I've read.Phoronix said:With the Piledriver support came work within AMD's Open64 compiler fork for handling AVX, XOP, FMA3, FMA4, BMI, TBM, and F16C instruction sets.
I went over what the bdver2 target adds: BMI, TBM, F16C, and FMA3.
For all of the tests carried out under the latest AMD Open64 compiler release for Linux, none of these common open-source Linux benchmarks benefited from being built under "-march=bdver2" for the latest Piledriver support (BMI/TBM/F16C/FMA3) compared to just targeting the first-generation Bulldozer processors. Once software is better able to take advantage of BMI/TBM/F16C/FMA3, we will hopefully see the FX-8350 become even more competitive.
I know Piledriver is old news but I'm curious anyway. I'd rather not have to spend a huge amount of time and effort to do some benchmarking but I'd like to do more than just run generic compiles in Ubuntu if there is a benefit to be had by making a bit more of an effort. However, if the Gentoo/Funtoo distro itself is going to be so much slower than something like Ubuntu it seems like it would make more sense to just compile the apps themselves with the processor-specific optimization rather than the whole OS.