Statistical Test

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
It's been quite a while since I took a statistics course and I'm having a hard time recalling the correct way to test whether two sets of data are statistically independent. The two sets of data are (x1,y1) and (x2,y2). In this case, "x" is simply a distance and "y" is a property that depends on distance. The 1 and 2 are the direction in which I'm testing, since there may be anisotropy in the property y. y1 and y2 are nonlinearly related to x1 and x2.

So, my question is: what statistical method do I need to determine whether y1 and y2 come from the same population or are statistically different?
 

PolymerTim

Senior member
Apr 29, 2002
383
0
0
Well, I didn't get so much out of my statistics class a few years ago, but can't you just collect multiple measurements of each and then compare the standard deviations of the two different data sets? Similar to what we do with single point measurements, except that you do it on each point in your data set.

So lets say you take 5 separate measurements of each dataset and each data set has 100 x,y points in it. Then you take the average and standard deviation of the 5 points at each x for data sets 1 and 2. Then you see if the averages of each data set fall within a standard deviation of the other data sets average. If you wanted to quantify it as a single number, maybe you could do something like a mean squared error (like when fitting a line).

I'm sure there's a better way to do it though if we can just get a statistician around here.

Out of curiosity, what kind of data are you working with? I once wrote a script to analyze the anisotropy of rings in 2D images (X-ray scattering patterns) that worked pretty well for me.

Good luck!

-Tim
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: PolymerTim
Well, I didn't get so much out of my statistics class a few years ago, but can't you just collect multiple measurements of each and then compare the standard deviations of the two different data sets? Similar to what we do with single point measurements, except that you do it on each point in your data set.

So lets say you take 5 separate measurements of each dataset and each data set has 100 x,y points in it. Then you take the average and standard deviation of the 5 points at each x for data sets 1 and 2. Then you see if the averages of each data set fall within a standard deviation of the other data sets average. If you wanted to quantify it as a single number, maybe you could do something like a mean squared error (like when fitting a line).

I'm sure there's a better way to do it though if we can just get a statistician around here.

Out of curiosity, what kind of data are you working with? I once wrote a script to analyze the anisotropy of rings in 2D images (X-ray scattering patterns) that worked pretty well for me.

Good luck!

-Tim
I'm measuring elastic modulus using indentation. It varies by about an order of magnitude with position within the biological tissue I'm working with. I cut the tissue two different ways and, therefore, indented in two different directions. Not necessarily the most rigorous approach, but it's a lot better than what anyone else has done in this area. :p

 

PolymerTim

Senior member
Apr 29, 2002
383
0
0
That sounds pretty interesting. I remember reading a paper a while back about mechanical testing of bone and the did something similar by cutting the bone in different directions for testing. I'll see if I can find it again and whether they did any special statistics, but I image averaging should work fine.
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: PolymerTim
That sounds pretty interesting. I remember reading a paper a while back about mechanical testing of bone and the did something similar by cutting the bone in different directions for testing. I'll see if I can find it again and whether they did any special statistics, but I image averaging should work fine.
I have multiple measurements at each position in each direction (9 measurements per spot, I think 20 spots per direction, 2 directions, for a total of 360 data points?).
 

madh83

Member
Jan 14, 2007
149
0
0
If you don't know the population variance, then you should use the two sample t-test with the hypotheses that the mean in the first sample = mean of second sample. Test at whatever confidence interval that's required, I think SAS or w/e statistics package you use does this automatically after plugging in the data. If they are significantly different, the hypotheses should be proven false at the CI you picked.

This should help:

http://en.wikipedia.org/wiki/Student%27s_t-test
 

PolymerTim

Senior member
Apr 29, 2002
383
0
0
Originally posted by: madh83
If you don't know the population variance, then you should use the two sample t-test with the hypotheses that the mean in the first sample = mean of second sample. Test at whatever confidence interval that's required, I think SAS or w/e statistics package you use does this automatically after plugging in the data. If they are significantly different, the hypotheses should be proven false at the CI you picked.

This should help:

http://en.wikipedia.org/wiki/Student%27s_t-test

That looks interesting; thanks for pointing it out madh83. I found even MS Excel has this function with a few options built in. So it looks like Cyclowizard would need a paired test since he has matched sets of data. One thing I'm not sure about is the number of tails (one-tailed or two-tailed) for the distribution. If I understand correctly from this it looks like he would need a one-tailed test to give the probability that the two data sets were equal.

So in Excel, the function would look like TTEST(Array1,Array2,1,1)
 

madh83

Member
Jan 14, 2007
149
0
0
Originally posted by: CycloWizard
It's been quite a while since I took a statistics course and I'm having a hard time recalling the correct way to test whether two sets of data are statistically independent. The two sets of data are (x1,y1) and (x2,y2). In this case, "x" is simply a distance and "y" is a property that depends on distance. The 1 and 2 are the direction in which I'm testing, since there may be anisotropy in the property y. y1 and y2 are nonlinearly related to x1 and x2.

So, my question is: what statistical method do I need to determine whether y1 and y2 come from the same population or are statistically different?

Okay, re-reading this, it actually looks like you're trying to see if a change in x has an effect on y. Is that what you're trying to figure out? If so, I might pose the question differently.



Tim, I just looked through excel, if I was doing this to answer the OP's question I'd probably first do an ANOVA test on the Ys to see if the variances are significantly different. If false, I'd use the two sample t-test with equal variances or vice versa if true, it should give you the probability you're looking for. The tests I've done have always been two-sided with the t-distribution. You can get the answer either way though, because two sided just means that you are using the same t-value on both sides of the distribution to get your confidence interval. So, if you get a value like 2.74 for a 90% CI (alpha=.10), the t-values corresponding to that are (-2.74,2.74).

This was all done with the analysis toolpak, I just clicked on it and it gave me the option of picking the test. I'm sure you could find the excel function for it too, but this was easier for me.


 

sjwaste

Diamond Member
Aug 2, 2000
8,757
12
81
OP said he needs to test if there is a non-linear relationship between the set of X and set of Y. A t-test is telling you whether a certain observation is statistically different from the mean of a sample, so I'm not sure if that's going to get at what you're looking for either. I think looking at the covariance like madh suggested is a good start, though. For linear relationships, I'd look at pearsons coefficients too.

I don't have a whole lot of answers for non-linear analysis, I'm afraid. I only got through undergrad advanced econometrics, and we stuck to linear modeling. Are we talking normally distributed data, at least? Am I getting hung up on the wrong detail?

EDIT: Can you transform either set to make the relationship linear? I think that you could use OLS or even pearsons if the relationship can be made linear in the coefficients.
 

PolymerTim

Senior member
Apr 29, 2002
383
0
0
One of us misunderstood the original question. My understanding is that there are two data sets (1 and 2) each with a set of independent x values with corresponding dependent y values. There is a non-linear relationship between x and y, but this is not what OP is asking. Rather the question is how to determine if the two data sets come from the same population or not. Please correct me if I'm wrong.
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: PolymerTim
One of us misunderstood the original question. My understanding is that there are two data sets (1 and 2) each with a set of independent x values with corresponding dependent y values. There is a non-linear relationship between x and y, but this is not what OP is asking. Rather the question is how to determine if the two data sets come from the same population or not. Please correct me if I'm wrong.
You're right. Sorry I didn't get back about the other responses recently... The woman needed hardwood floors, so the research went on the back burner for a couple days. :p
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Well, I finally found something that isn't ideal, but it works. I'm just using multiple linear regression. The p value in question comes out to be ~0.4, so even though the model is nonlinear, the nonlinearity isn't likely to induce a massive change in the p value. If anyone can come up with a better method, I'd still like to hear it. Otherwise, I'm sure one of the reviewers of this paper will be sure to point them out to me later. :p Thanks for all the input guys.
 

KoolAidKid

Golden Member
Apr 29, 2002
1,932
0
76
Originally posted by: madh83
If you don't know the population variance, then you should use the two sample t-test with the hypotheses that the mean in the first sample = mean of second sample. Test at whatever confidence interval that's required, I think SAS or w/e statistics package you use does this automatically after plugging in the data. If they are significantly different, the hypotheses should be proven false at the CI you picked.

This should help:

http://en.wikipedia.org/wiki/Student%27s_t-test

Given my reading of your situation, I agree with the above post. This sounds like a situation for which an independent-samples t-test would be most appropriate. If I read your post correctly, you are trying to determine if the datasets y1 and y2 are sampled from the same population. If this is in fact what you are trying to do, then I would use the independent-samples t-test. Your statistical hypotheses for a 2-tailed test are as follows:

H0 (null hypothesis): the mean of the population that y1 was sampled from = the mean of the population that y2 was sampled from

H1 (alternate hypothesis): not H0

You can run this test in excel by formatting your data in 2 columns, one for y1 and one for y2. In an independent-samples test the numbers on any given row are not related to one another. You can then use the Analysis ToolPak (t-test: 2 sample with equal variances) to test this pair of hypotheses.

Unless I am missing something I do not think that multiple linear regression is appropriate. That technique is used to determine the degree of linear relationship between a set of predictor variables and a dependent variable.

 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: KoolAidKid
Given my reading of your situation, I agree with the above post. This sounds like a situation for which an independent-samples t-test would be most appropriate. If I read your post correctly, you are trying to determine if the datasets y1 and y2 are sampled from the same population. If this is in fact what you are trying to do, then I would use the independent-samples t-test. Your statistical hypotheses for a 2-tailed test are as follows:

H0 (null hypothesis): the mean of the population that y1 was sampled from = the mean of the population that y2 was sampled from

H1 (alternate hypothesis): not H0

You can run this test in excel by formatting your data in 2 columns, one for y1 and one for y2. In an independent-samples test the numbers on any given row are not related to one another. You can then use the Analysis ToolPak (t-test: 2 sample with equal variances) to test this pair of hypotheses.

Unless I am missing something I do not think that multiple linear regression is appropriate. That technique is used to determine the degree of linear relationship between a set of predictor variables and a dependent variable.
Normally, that would be fine. However, since the corresponding independent variables x1 and x2 are dissimilar for the two data sets, I can't do this sort of apples-to-apples comparison. This is why I resorted to a regression method which, while less than ideal, is the only way I can find to compare two sets with different independent variable locations.
 

PolymerTim

Senior member
Apr 29, 2002
383
0
0
I can see how that poses some complexity. Do you have enough data points that you could do a simple interpolation without introducing significant error? Given the test method you described, I'm guessing the data sets aren't that large so maybe this won't work. So what you're trying to do then is to perform nonlinear regression to get a good fit and then interpolate to compare the two data sets? or do you just compare the two fit equations?
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: PolymerTim
I can see how that poses some complexity. Do you have enough data points that you could do a simple interpolation without introducing significant error? Given the test method you described, I'm guessing the data sets aren't that large so maybe this won't work. So what you're trying to do then is to perform nonlinear regression to get a good fit and then interpolate to compare the two data sets? or do you just compare the two fit equations?
I just used a linear regression. Not the most rigorous way, but with a p of 0.4 and the slight nonlinearity of the system, I don't think it will make much difference. Of course, I'm sure the reviewers will tell me how stupid I am in a few weeks when I get this paper back, but I'm used to the abuse - I'm a fifth year grad student. :p
 

sjwaste

Diamond Member
Aug 2, 2000
8,757
12
81
Originally posted by: CycloWizard
Originally posted by: PolymerTim
I can see how that poses some complexity. Do you have enough data points that you could do a simple interpolation without introducing significant error? Given the test method you described, I'm guessing the data sets aren't that large so maybe this won't work. So what you're trying to do then is to perform nonlinear regression to get a good fit and then interpolate to compare the two data sets? or do you just compare the two fit equations?
I just used a linear regression. Not the most rigorous way, but with a p of 0.4 and the slight nonlinearity of the system, I don't think it will make much difference. Of course, I'm sure the reviewers will tell me how stupid I am in a few weeks when I get this paper back, but I'm used to the abuse - I'm a fifth year grad student. :p

I would still suggest a log-log OLS model if you're dealing with some non-linear relationship just to see how close it comes to the one you already ran.

EDIT: Looks like you turned it in, you should be fine with OLS. I'm not sure that it's going to tell you whether or not they're from the same population, but if you backed up whatever correlation you found with something else to suggest it, it should be fine.
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: sjwaste
I would still suggest a log-log OLS model if you're dealing with some non-linear relationship just to see how close it comes to the one you already ran.

EDIT: Looks like you turned it in, you should be fine with OLS. I'm not sure that it's going to tell you whether or not they're from the same population, but if you backed up whatever correlation you found with something else to suggest it, it should be fine.
Nope, didn't turn it in yet. The result of this test dictates a lot about how the results are interpreted and how I need to model them, so I couldn't move forward until I decided the result of the statistical test. The data are pretty well fitted by an exponential function, so I suppose I should try the log-transformed regression. Not sure why I didn't think of that... I think I'm trying too hard. :p
 

sjwaste

Diamond Member
Aug 2, 2000
8,757
12
81
Originally posted by: CycloWizard
Originally posted by: sjwaste
I would still suggest a log-log OLS model if you're dealing with some non-linear relationship just to see how close it comes to the one you already ran.

EDIT: Looks like you turned it in, you should be fine with OLS. I'm not sure that it's going to tell you whether or not they're from the same population, but if you backed up whatever correlation you found with something else to suggest it, it should be fine.
Nope, didn't turn it in yet. The result of this test dictates a lot about how the results are interpreted and how I need to model them, so I couldn't move forward until I decided the result of the statistical test. The data are pretty well fitted by an exponential function, so I suppose I should try the log-transformed regression. Not sure why I didn't think of that... I think I'm trying too hard. :p

Hey, I've seen a ton of your posts go way over my head, so if I'm helping someone for once I'll take it :)

If it fits well to an exponential function, log-log should work and keep it linear in the coefficients (least squares regression is the best linear unbiased estimate if you stick to the gauss-markov assumptions). Out of curiosity, how many observations do you have in the dataset?

In terms of inferencing that your sets of X and Y are from the same population, are you using this in conjunction with some other research?
 

CycloWizard

Lifer
Sep 10, 2001
12,348
1
81
Originally posted by: sjwaste
If it fits well to an exponential function, log-log should work and keep it linear in the coefficients (least squares regression is the best linear unbiased estimate if you stick to the gauss-markov assumptions). Out of curiosity, how many observations do you have in the dataset?
Nine at each x value for each dataset. One direction has 12 x values and the other has about 20 (can't recall exactly off the top of my head and I don't have it on this computer :p).
In terms of inferencing that your sets of X and Y are from the same population, are you using this in conjunction with some other research?
The mechanical properties will be plugged into a mechanical model down the line. In the model, if I want an accurate result, I should input whether the material is isotropic or anisotropic. However, the boundary conditions for the mechanical model are very complicated (in fact, they must be found by comparing deformations iteratively with experiments), so I can't imply the mechanical properties from the mechanical model: they must be accurate at the input stage.