Question on Xeon Processor

wenboy

Junior Member
Oct 20, 2012
12
0
0
Hi all,
Recently, when I ran a Matlab program (Matlab 2011 a), I observed a noticable difference between an HP dv6z A8-3550mx (2.0/2.7GHz ) running Win8 and an DIY A8-3850 (2.9GHz) runing Win7. For curiousity sake, I decided to perform a little Matlab bench on my own. The "supposedly" fastest machine, I have access to is a 3 year old, company provided build machine on Windows 2008 (HP z-400 work station, Xeon W3550 3.06GHz). When I used that as an reference, I found something even more interesting.

Matlab benching program I used is a script (not a function, so it compiles and runs). The bench contains 10,000 discrete biot-savart operation and puke fprintf 4 debug messages on the command window. Each iteration runs this bench 10 times (compile script 10 times and run functions 10 times), and there are 10 iterations in total.

I used the second set of 10 iterations, (so everything is cached nicely) , I recorded the following

Xeon W3550 recorded:
14.6206 14.5981 14.6176 14.6121 14.6090 14.6179 14.5892 14.6454 14.6040 14.6094

My own A8-3850 recorded:
14.4000 14.3661 14.3324 14.3389 14.3378 14.3643 14.3551 14.3484 14.3336 14.3708

(I cannot comment on Win8 bench due to NDA.)

On paper, Xeon processor should be much faster than newer llano. This is not something I expected. Can somebody elaborate me on why this is the case?


The following is the base script I use.

fprintf('fprintf half circle 0.003125005086285 to pi - 0.003125005086285 \n');
R = 0.008;
x = R * cos([0.003125005086285: 1*10^-4: pi - 0.003125005086285]);
y = (R^2 - x.^2).^(0.5);
z = zeros(1, length(x));
fprintf(
'fprintf function 2, half circle from x = %d to x = %d, with dx = %d \n', x(1), x(end), x(2) - x(1));
%fprintf('Step a): half circle from x = %d to x = %d, with dx = %d \n', x(1), x(end), x(2) - x(1))

wirelist = [x; y; z];
remotepoint = [0; D; 0];
I = 3;
mu_r = 100;

fprintf(
'fprintf function 3, calculate Bz \n');
%fprintf('Step b): calculate Bz \n')

Bz = DiscretePlanarBiotSavart(wirelist, remotepoint, I, mu_r);
fprintf(
'fprintf function 4, Bz %d \n', Bz);
 

MarkLuvsCS

Senior member
Jun 13, 2004
740
0
76
I am too curious on what is causing this. 3.06 ghz i7-950 (closest match to the Xeon variant) vs. 3.8ghz A10-5800k seems to show AMD losing in every CPU benchmark with a healthy frequency advantage. This would lead me to believe the bottleneck isn't remotely the CPU. At least you know as a bonus, you can do some of your work from your laptop and not worry about speed vs. your workstation for this particular load. Perhaps testing with higher factors of iterations to see how the results change can shed more light on the limiting element in that particular workload.
 

wenboy

Junior Member
Oct 20, 2012
12
0
0
I am too curious on what is causing this. 3.06 ghz i7-950 (closest match to the Xeon variant) vs. 3.8ghz A10-5800k seems to show AMD losing in every CPU benchmark with a healthy frequency advantage. This would lead me to believe the bottleneck isn't remotely the CPU. At least you know as a bonus, you can do some of your work from your laptop and not worry about speed vs. your workstation for this particular load. Perhaps testing with higher factors of iterations to see how the results change can shed more light on the limiting element in that particular workload.

Thanks for reply. Originally, I thought it would be because Xeon doesn't have a decent FPU. But that shouldn't be the case, because if I delete all the fprinf functions (not the use case I have, because I still want those debug messages here and there.) there is a noticible improvement of approximately 0.40 seconds per 10 runs, making it on par (.1-ish slower) with A8-3850 run no fprintf.

But, even ignore the fprintf lags, on Xeon Win server 2008. That should not explain why that is the case. Although my original intent is Win8 vs Win7. But win 2008 server should be much more efficient than win7 ( doesn't have those not needed services running in the background).

I believe Matlab R 2011a is not optimized for AMD APUs (they don't exist when that version of matlab was published). And yes the good news is that I never noticed any difference between my llano desktop/laptop vs this 3 year old but still 1000-dollar-work station. But the bad news is the IT guys doesn't know what they are doing ... :(
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Does it compile it to machine code? If it is just comping to byte code, the interpreter overhead could be taking most of the time. My understanding is that Matlab, without doing anything special, is an interpreted runtime environment.

The differences might indeed not be night and day between them, depending on what you do, but nearly identical scores lead me to believe that both CPUs are being bottlenecked by something else. Does Matlab have a profiling tool?

I believe Matlab R 2011a is not optimized for AMD APUs (they don't exist when that version of matlab was published). And yes the good news is that I never noticed any difference between my llano desktop/laptop vs this 3 year old but still 1000-dollar-work station. But the bad news is the IT guys doesn't know what they are doing
Nah. You get a workstation to have server hardware on your desktop. It's usually more about having a little better RAS, than it is speed.
 
Last edited:

videogames101

Diamond Member
Aug 24, 2005
6,783
27
91
According to the OP, removing them helps, but not that much (that was my first thought, too).

didn't read past the OP lol,

try running the same code under C/C++ and see if Cerb's suspicion about interpretative overhead is correct. (although that could take some time to convert)
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
If there's a library that can do the "DiscretePlanarBiotSavart" function, Java or C#/VB.NET/F# would work, too, and be easier. Or, if Matlab has a profiler (again, I don't do Matlab; I'm going by hearsay and quick Google research), the OP might be able to find out what opcodes are eating the most time, and what features are in use. FI, if the "DiscretePlanarBiotSavart" function is a scalar kernel, with a mix of x87 and SSE(2), and has a small enough cache footprint, then Llano might just, coincidentally, in this case, really be just as fast--that the differences between cores and the clock speeds they are running at just happens to make them perform closely.

Relavent link
 

wenboy

Junior Member
Oct 20, 2012
12
0
0
If there's a library that can do the "DiscretePlanarBiotSavart" function, Java or C#/VB.NET/F# would work, too, and be easier. Or, if Matlab has a profiler (again, I don't do Matlab; I'm going by hearsay and quick Google research), the OP might be able to find out what opcodes are eating the most time, and what features are in use. FI, if the "DiscretePlanarBiotSavart" function is a scalar kernel, with a mix of x87 and SSE(2), and has a small enough cache footprint, then Llano might just, coincidentally, in this case, really be just as fast--that the differences between cores and the clock speeds they are running at just happens to make them perform closely.

Relavent link

didn't read past the OP lol,

try running the same code under C/C++ and see if Cerb's suspicion about interpretative overhead is correct. (although that could take some time to convert)


Thanks cerb and 101,

I will try to do both C++ and C# to see what would be the difference. I have done everything that matlab suggested to improve the performance of my functions in the original bench.

From what I know, yes matlab inteprets / (compiles) a function once in the runtime for the lifetime of the function. And Interprets (complies) a script on the fly. That interpretation itself is a multithreaded process. But running? I think running just takes one core and that is where the FPU comes up.