Chapter 5 Sampling Distributions

5.1 Review: Two branches of statistics

Descriptive statistics offer tools to describe variability. Inferential statistics are used to make conclusions about populations (typically, large populations) from samples (typically, a relatively small number of sample data). When we do inferential statistics, we often use null hypothesis significance testing (NHST) to see if our data (our observations) can falsify a null hypothesis. Inferential statistics and NHST let you use probability to make a logical statement about a population. Over the next few sections, we will explore the foundations of inferential statistics.

5.2 Sampling

Sampling is the process of selecting a group representative of the population.

Remember that the population is the entire group of interest. Examples: people, nursing home residents, consumers, etc. The population is the group we want to study. Populations may be any group you want to test a hypothesis about. It is up to the researcher to define the population. Populations can be small (e.g., “this study is about the 20 members of a class”), large (e.g., “this study is about adults in the United States”), and any size in between. Typically, research is most interesting and valuable when it applies to large groups of people.

The sample is a smaller set from the population. Sampling is the process of selecting individuals (usually at random) from a population. In mathematics, sample size is typically represented as n. It is the number of scores in the sample. The size of a population is typically represented as N. In APA-style manuscripts, however, sample size is represented as N.

5.3 Statistics are based on samples, Parameters are based on populations

Descriptive statistics applied to entire populations are called population parameters. Descriptive statistics applied to a sample, are called sample statistics.

5.4 Sampling Distribution

If you take repeated samples, you can plot the mean of each sample. A collection of sample means forms a sampling distribution of the mean.

Important point: Sampling distributions are made of many samples.

A sampling distribution can be shown as a probability density function (PDF). Recall that PDFs and histograms represent the same information (frequencies).

5.5 Standard Error

Sampling distributions have a mean and standard deviation, just like any other distribution we have seen. However, the standard deviation of a sampling distribution has a special name: the standard error.

Standard error is calculated using this formula: \(\sigma_{\bar{X}}=\frac{\sigma}{\sqrt{n}}\)

In words: divide the standard deviation of the population by the square root of the sample size.

5.6 The Central Limit Theorem

The central limit theorem is fundamental to inferential statistics. It’s also commonly misinterpreted. Here is an explanation of the central limit theorem (CLT).

The CLT says that: (1) assuming two things, (2) if you do a series of steps, then (3) you will obtain an outcome.

  • The two assumptions are a random sample and a variable that is continuous.
  • The steps are to take repeated random samples of the population and calculate the mean of each of those samples. Construct a sampling distribution from the sample means.
  • The outcome is that the histogram of the sample means is normally distributed. We call this the sampling distribution of the mean. It will always be normally distributed under the CLT, as long as we have a sufficiently large sample size.
  • This frequency distribution, like all frequency distributions, has a standard deviation called the standard error of the mean.

Central Limit Theorem allows us to say that:

  • Our random sample will have a mean that approximates the population mean. We can use samples in place of having to measure every member of the population.
  • Each time we take a random sample and calculate the mean, we are most likely to get the population mean. The population mean is the most likely outcome.
  • It is possible to take a random sample and calculate the mean only to get a sample mean that is far away from the population mean, but this is unlikely to happen.
  • A larger sample size reduces the standard error of the mean.

5.7 Why are we doing this? Point estimates

The central limit theorem says that we can take a single sample and compute the mean to get a point estimate. A point estimate is a sample statistic that serves as an estimate of a population parameter. This makes inferential statistics useful and powerful. By taking a relatively small sample (maybe 20 people) and computing a statistic, we can use inferential statistics to make a conclusion about a very large population (maybe hundreds of thousands of people).

Because of the central limit theorem, the mean of your sampling distribution serves as a point estimate for the population mean. Also because of the central limit theorem, the most likely mean for any random sample is the population mean.

5.8 Confidence Intervals

The confidence interval is a complement to the point estimate. The confidence interval gives an idea of how the samples would vary if many sample distributions were constructed. If a large number of sample distributions are created, and a confidence interval is computed for each sample, then approximately 95% of these confidence intervals will contain the population mean. Ultimately, you can be 95% sure that the 95% confidence interval includes the population parameter.

This interpretation is counter-intuitive (you may want to reread it), but think of confidence intervals as a range of values that include the population parameter a certain percentage of the time. Confidence intervals are about replication (many repeated samples).

The size of a confidence interval depends on sample size. With larger sample sizes, standard error is less, and the confidence interval is smaller. With smaller sample sizes, standard error is more, and the confidence interval is bigger.

Formula for the lower bound of the 95% confidence interval: \(\bar{X}-1.96\sigma_{\bar{X}}\)

In words: The population mean minus 1.96 standard deviations of the sampling distribution (the standard error)

Formula for the upper bound of the 95% confidence interval: \(\bar{X}+1.96\sigma_{\bar{X}}\)

In words: The population mean plus 1.96 standard deviations of the sampling distribution (the standard error)