Comparing Data to Model with Norm

Status
Not open for further replies.

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Update: After reading up on R^2 and realizing that it's not really useful for this application, I've been thinking about this a little more. I think the correct approach might be to use a norm (e.g. the L2 norm) to compute the error. I can then compute the relative error in the norm to give a pretty solid comparison of the data to the model. I updated the thread title accordingly.

=========================================
I have a relatively straightforward program that solves some model equations y=f(x). I want to see how well this nonlinear model correlates with data z=g(x), eventually arriving at an R^2 value. I can find two ways to do this in MATLAB that give amazingly disparate results.

The first is using MATLAB's built-in corrcoef function, which computes the normalized correlation coefficient R:
R=corrcoef(model,data))
Rsq=R.^2


The second is calculating R^2 using the formula that I am familiar with (i.e. R^2=1-SS_err/SS_tot):

SS_err=sum((Data-Model).^2);
SS_tot=sum((Data-mean(Data)).^2);
Rsq=1-SS_err./SS_tot;

I have four sets of data and three sets of models, so Rsq ends up being a 3x4 matrix. Method #1 gives

0.9408 0.9846 0.9370 0.9408
0.9381 0.9840 0.9383 0.9381
0.9431 0.9908 0.8999 0.9431

Method #2 gives

0.5972 -0.1735 0.8220 0.4431
0.1777 0.2551 0.7937 0.8672
-2.9059 0.9313 -3.5218 0.7585

Obviously, method #1 makes my model look amazing, but I don't think it's *that* good, and I'm not familiar with computing R^2 using that method, so I'm not even sure if it's really appropriate. Is anyone more familiar with this and can tell me which way is more appropriate, or if a completely different method might be better?

edit: Looking at the results a little closer, it seems that Method #1 results are higher when the shape of the model curve is more similar to the data, but doesn't account for offsets/differences in amplitude of the changes. Method #2 appears to be very sensitive to offsets/differences in amplitude. It's almost like #1 is giving a qualitative comparison of the trends, while #2 is giving a comparison of the quantitative results, though maybe I'm reading too much into it.
 

Farmer

Diamond Member
Dec 23, 2003
3,334
2
81
Finally, someone who asks a question that is not about grad-level particle physics. And yet, I still don't know the answer.

I'm quite sure both results are quantitative, as there is nothing in MATLAB that is qualitative. I have never done statistics on MATLAB (though I'd imagine it to be quite powerful), nor do I have much experience with nonlinear least-error curve-fitting, but perhaps your expression for R^2 only applies to LLSR/linear models? That may explain the huge discrepancy: I've never read a stats textbook beyond the high school level, so please don't ask me to describe further, I'd just link you to a Wiki page.

Perhaps this help file on the "corrcoef" function will help?

http://www.mathworks.com/acces...hdoc/ref/corrcoef.html

There is an expression there that would probably make more sense to you than to me.
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
I had the same thought - that perhaps my definition of R^2 is really only true for linear models. I threw together a quick program that calculates R^2 using both methods and tested it for linear and nonlinear models. The corrcoef method does not give R^2=1 for nonlinear models even when the model exactly describes the data, whereas the other method does. The corrcoef method is also completely insensitive to noise in the data (i.e. it will return the exact same result, regardless of how much noise I add), whereas the other method is highly noise-dependent. Code is below:

NoiseFactor=0.01;%amplitude of noise
x=0:0.01:1;%independent variable
y=x.^2;%"model"
y_hat=y+(0.5-rand(size(y))).*NoiseFactor;%"data"
plot(x,y_hat,'kx',x,y,'r-');
[R,P]=corrcoef(x',y');
Rsq_corrcoef=R(1,2).^2

SSR=sum((y-y_hat).^2);
SST=sum((y-mean(y_hat)).^2);
Rsq=1-SSR./SST
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
After reading up on R^2 and realizing that it's not really useful for this application, I've been thinking about this a little more. I think the correct approach might be to use a norm (e.g. the L2 norm) to compute the error. I can then compute the relative error in the norm to give a pretty solid comparison of the data to the model. I updated the thread title accordingly.
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: GWestphal
Do you require a single number to describe the quality of the fit? Why not make a histogram of the variance of the Data-Model binned with some small window? That would show you where it worked and where it didn't in addition to how well it fit the model. Can you tell us more about the function in question, is it periodic or what?
I just need a way to quantify how well the model is describing the data. It's visually apparent where the model is or isn't working, so regression plots/histograms aren't really necessary. Plus, since the model is based on first principles, it isn't trained with the data at all, nor is it fitted to the data using least squared or other techniques, so looking at the residual distribution won't tell me what I need to know. I'm not saying that that's not a useful technique, only that it's not going to give me the numbers I need for this particular problem.

The "function" is generally a numerical solution to a constraint equation. They are definitely not periodic, and the ones I have been able to derive a closed-form solution are generally something similar to y=1/(1+x)^2, though others are much more complicated (so complicated that I won't even try to type them out here :p).

I chose the L2 norm because it is a very objective way to tell me how far the data points lie away from the model.
 
Status
Not open for further replies.