Simple statistical analysis question

MarkLuvsCS · Apr 15, 2010

I wasn't sure where else to ask this although I agree this is probably an odd place for it.

I haven't used any statistics for about 10 years now. I wanted to perform a simple analysis for a project but I was really unsure of where to start. I have about 100,000 data points and would like to use a sample of the data to extrapolate for the entire group. I think I remember in order to make a good assessment I need to perform a test to see how large of a sample size I need. The data is in a sorted list alphabetically. I was wondering if it would be statistically accurate if I took the list and depending on how many I needed I could use every Z # to gather the sample of data from.

Z = (X # of total data points) / (N # gathered from sample size requirement)

I will try to give a general idea on types of data.

Code:

          a        b         c             d               e
1     North        A      English       Active           Boat A
2     South        B      Science       Active           Boat B
3     South        C      Math          Inactive         Boat C
4     North        A      Math          Active           Boat A
5     North        D      English       Inactive         Boat C

This is not the exact same data being used but it is representative of what it sorta looks like in the database. I just want to make some simple assertions like 30% of the people are in Math, 22% are rank A, or 75% of the people are active.

If someone could please point me in the right direction on some basic theories/tests, I would be very appreciative. I looked on wikipedia to see if I could find this information but it seemed there were considerably more complex tests that would be overkill for what I think will be a very simple problem. I would consider myself fairly decent in Mathematics. Just for background information, I have taken courses from stats/trig to Calc II,III, and Differential Equations so I won't be totally clueless if you ask me to go somewhere.

I'm sorry for the long rant, but thanks for your time!

Mark R · Apr 15, 2010

Whenever you measure a 'sample' - i.e. something less than the whole population - you potentially mis-measure. E.g. just due to chance, you might randomly select only people who do Maths, and get a false result.

To get around this, whenever you measure a sample, you must also calculate the 'confidence interval'. This gives an idea of how accurate your measurememnt is likely to be - i.e. your confidence in your result.

Let's say that x% of the students in the country took English in any one year.

Let's do an experiment. You ask 100 students, whether they took English that year. 30 say yes. What is x (i.e. the proportion of students who took English)?

The best estimate of x is 30%, but with a 95% confidence interval of 21.2% - 40.0%

This means that you can be 95% sure that the actual value of x is between 21.2 and 40.0%. There is a 5% chance that the actual value is outside that range.

The larger the size of your sample, the narrower your confidence interval will be. If you asked 1000 students, and 300 said yes; the answer would be 30% with 95% CI of 27.1 - 32.9%

The calculation of the confidence intervals can be complex, and there are numerous methods with various advantages and disadvantages.

Reading up 'confidence intervals' in a suitable statistics reference, would be a good start.

MarkLuvsCS · Apr 15, 2010

Ahh yes thanks Mark! I do recall a bit of that! Thanks a bunch!

edcarman · Apr 16, 2010

I'm a bit rusty on this, but I think your sampling method of taking every Zth item from a sorted list may be biased.

An unbiased sample is one in which each member of the population has an equal chanced of being selected. Selecting items at a fixed interval in an ordered group negates this. e.g if you select every 5th item, starting with No. 5 and you only have four As, then your sampling mehtod makes it impossible for an A to ever be selected.

I think you could get around this by either selecting every Zth item from a randomly ordered list, or else generating random numbers to tell you which ones to pick. e.g. if you want 500 samples from your list, generate 500 random numbers between 1 and 100,000 and then pick those entries.

Finally, as an academic exercise, 100,000 items is a small enough population that you could probably calculate statistics for the entire population to compare with your sample stats.

daishi5 · Apr 16, 2010

The best way to get an unbaised sample is to use a random selection process. If your data was entered randomly then selecting every z item would also count, but there is no reason to assume the data was entered randomly from what you have mentioned. My own personal experience is that people tend to organize work they are doing for data entry, which removes the randomness.

Excel is probably the best tool for something like this, using countif statements, you could easily calculate the actual %s for the entire population rather than estimating from a sample. If you want to use a sample, excel is still your friend. One thing to pay attention to, is your examples were all about binary forms of data, such as a person can only be in one of two states, in Math or not in Math. I haven't worked with that in a while, but http://en.wikipedia.org/wiki/Binomial_distribution is a good place to start it seems.

Normally with something like this, you would set your variables such as "in math" and "not in math" to 1 and 0 respectively, but don't set them to values such as "in english = 2" "in math = 1" and "in neither = 0".

homercles337 · Apr 28, 2010

Why bother with resampling at all? If the 100k is your population just calculate exact percentages. 100k is not a lot.

Search

Simple statistical analysis question

MarkLuvsCS

Senior member

Mark R

Diamond Member

MarkLuvsCS

Senior member

edcarman

Member

daishi5

Golden Member

homercles337

Diamond Member

TRENDING THREADS