STATISTICS 101

Not signed in

For virtually any question you may want to investigate about the world, you have to focus on a particular group of individuals (for example, a group of people, a group of other animals, cities, rock specimens, exam scores etc.). The group of individuals being studied in order to answer research question is called a population.^[1,pp.51-52] If data is collected from the entire population, that process is called a census.^[1,p.54] Studies don't usually have time or money to gather data on every single individual in a population. To find out something about a population, researchers select a subset of individuals from the population, study those individuals and use the information gathered to draw conclusions about the whole population. This subset of the population is called a sample.^[1,p.52]

A parameter is a single number that describes a population, for example, the median household income in a country. A statistic is a single number that describes a sample, for example, the median household income in a sample of 1200 households.^[1,p.203] The parameters of a population are usually unknown, so data is collected on a sample from the population, the results from the sample are analysed and conclusions are made regarding the entire population based on the sample results.^[1,p.18]

Sample size

Sample size is the final number of individuals in the study, the final size of data set .^[1,p.2] Sample size is commonly denoted by letter "n". Sample size is an important factor in determining the accuracy and repeatability of the results.^[1,p.42] With a large population of thousands, it is the size of the sample, not the size of the population that determines the accuracy of the results. For example, a random sample of 1000 individuals, has the estimated accuracy level of 3,2 percentage points, no matter whether the sample is from a small town of 10 000 individuals, a region of 1 million people or the entire country.^[1,p.268] Many surveys are based on large numbers of participants, but this isn't always true for other types of research, such as carefully controlled experiments. Because of the high cost of some types of research in terms of time and money, some studies are based on small number of participants. Researchers have to find the appropriate balance when determining sample size.^[1,p.42]

It is important to realize that sample results will vary from sample to sample. Statistical results based on samples should include a measure of how those results are expected to vary.^[1,p.171] That is to say that samples themselves will have a distribution of possible results. Statistical results are evaluated by measuring their accuracy, typically through the margin of error. The margin of error reflects how much the results are expected to vary from sample to sample.^[1,p.341]

The two most important ideas regarding sample size and margin of error are the following:^[1,p.197]

Sample size and margin of error have an inverse relationship.
After a point, increasing sample size gives diminishing return on increasing confidence level.

Having an inverse relationship means that as the sample size increases, the margin of error decreases. This relationship is called an inverse because the two move in opposite directions. The more information is gathered, the more accurate the results will be, given that the data is gathered and handled properly. ^[1,p.198]

The relationship between the margin of error and sample size is non-linear, hence increasing the sample size has diminishing returns. For example, for the same confidence level of 95% and sample proportion p̂=0.52, increasing the sample size from 500 to 1000 to 1500 and 2000 will decrease the margin of error from 4.38% to 3.10 to 2.53% and 2.19% respectively. The extra cost of increasing the number of data points (i.e. have more participants in the study, survey more people etc.) to achieve the small improvement of margin of error may not be worthwhile at some point.^[1,p.198]

Bias & Randomisation

Bias is the systematic favouritism [intended or accidental] of certain types or groups of individuals or certain responses.^[1,p.14] Bias can occur due to the way a sample was selected or due to the way data is collected.^[1,p.55] If the method for selecting a sample data points from a population is biased, then the results will also be biased. Within the field of statistics it is not possible to measure bias; it is only possible to minimise it by designing good samples and studies.^[1,p.209]

Some of the most common sources of bias are:^[1,p.340]

Bad study design.
Selection of a sample that does not represent the population of interest.
Miscalibration of the measurement instrument.
Lack of objectivity on the part of the researchers.

To minimise bias in a study the sample needs to be selected randomly. Sample selection is unbiased if every possible sample of equal size has an equal chance to be selected for the study.^[1,p.49] To take an authentic random sample, a randomisation mechanism is needed to select the individuals. An example of random sampling involves the use of random number generators. In this process the items in the sample are chosen using a computer generated list of random numbers, where each sample of items has the same chance of being selected.^[1,p.53] Note that in designing an experiment, collecting a random sample of individuals to participate often isn't ethical because experiments impose a treatment on the subjects.^[1,p.13] Such type of studies are conducted using subjects that volunteer to participate, they self-select. But randomness can be incorporated in experiments in a different way - by randomly assigning the subjects to the treatment group and the control group. If the groups are assigned at random, they have a good chance at being similar, except for the treatment they receive. That way, if a large enough difference is found in the outcomes for the groups, it can be attributed to the treatment, rather than to other factors.^[1,p.341]

Both the quality and the quantity of information is important in assessing how accurate a statistic will be. The more good data goes into a statistic, the more accurate that statistic will be. Small sample sizes make results less accurate (unless the population is small to begin with). However, more data isn't always better data - it depends on how well the data was collected. A small random sample with well collected data is much better than a large non-random sample with poorly collected data.^[1,p.342] No matter how large a sample is, if it is based on non-random methods, the result will not represent the population. A small random sample is more representative than a large non-random one.^[1,p.54] Sample size should always be reported along with the results of any study.

Data

Statistics

Charts

Data Sets

Sources

Sample size

Bias & Randomisation