Nimo

Analysing univariate distributions and sampling

StatisticsStatistics

Flashcards

Test your knowledge with interactive flashcards

What are the five summary values shown in a boxplot?

Click to reveal answer

Minimum, lower quartile (Q1), median (Q2), upper quartile (Q3) and maximum.

Key concepts

What you'll likely be quizzed about

Visual representations of univariate data

Histograms group continuous or grouped data into bins and display frequency by bar height. Bar heights reveal shape (symmetry or skew), peaks (modes) and gaps. Boxplots summarise five statistics: minimum, lower quartile (Q1), median (Q2), upper quartile (Q3) and maximum; box length shows interquartile range and whiskers highlight spread and outliers. Stem-and-leaf diagrams retain original values and reveal distributional shape for smaller data sets. Frequency tables and cumulative frequency tables provide counts and allow median and quartile estimation through interpolation when needed.

Summary statistics: centre and spread

Mean gives the arithmetic average and responds strongly to extreme values; median gives the middle value and resists outliers. Mode identifies the most common value or class. Range measures total spread; interquartile range (IQR = Q3 − Q1) measures middle 50% spread and resists extreme values. Standard deviation and variance measure average deviation from the mean; larger standard deviation indicates greater dispersion. Choice of centre and spread depends on shape and outliers: use median and IQR for skewed distributions or when outliers exist; use mean and standard deviation for roughly symmetric distributions without extreme outliers.

Shape, skewness and outliers

Symmetric distributions have mean and median close together and similar tails on both sides. Positive skew (right-skew) produces a long tail to the right and moves the mean above the median; negative skew (left-skew) moves the mean below the median. Outliers lie far from the bulk of data and affect the mean and standard deviation more than the median and IQR. Outliers may indicate data-entry errors, unusual but valid observations, or a mixture of populations. Investigation of outliers requires checking data sources and considering whether exclusion is justified; exclusion changes summary statistics and subsequent inferences.

Comparing univariate distributions

Comparison uses visual tools and summary statistics together. Side-by-side boxplots reveal differences in medians, IQRs and outliers; histograms show differences in shape and modality. Numerical comparison uses differences in centres and relative spreads: compare medians for skewed data and means for symmetric data, and compare IQRs or standard deviations to assess variability. Cause → effect: larger sample size provides clearer features in plots and reduces random fluctuation, so comparisons become more reliable; sampling bias or unequal sample selection produces misleading apparent differences.

Sampling methods and sampling error

Random sampling selects units so that each member of the population has known, usually equal, chance of selection; random sampling reduces selection bias and allows probabilistic statements about uncertainty. Systematic, stratified and cluster sampling provide alternatives with trade-offs in practicality and precision. Non-random sampling (convenience, voluntary response) introduces bias and limits inference. Cause → effect: increasing sample size reduces sampling error and narrows estimates; however, poor sampling method causes bias that is not fixed by larger size. Sampling error reflects natural variation between different samples drawn by the same method.

Inference: estimation and limitations

Point estimates use sample statistics (sample mean, sample proportion) to estimate population parameters. Interval estimates provide a range that likely contains the population parameter; wider intervals reflect more uncertainty. Confidence intervals use sample statistics and assumed sampling distributions to quantify uncertainty, but rely on assumptions such as random sampling and approximate normality for means in moderate to large samples. Limitations arise from bias, small sample size, non-normal distributions for small samples, measurement error and unrepresentative sampling. Clear statement of assumptions and potential sources of error is essential when applying statistics to describe a population.

Key notes

Important points to keep in mind

Use visual displays and summary statistics together for reliable interpretation.

Choose median and IQR for skewed data or when outliers exist.

Choose mean and standard deviation for roughly symmetric distributions without extreme values.

Larger sample size reduces random sampling error but does not remove bias.

Random sampling allows probabilistic statements about uncertainty; non-random sampling limits inference.

Report assumptions and possible sources of bias when inferring about a population.

Investigate outliers before deciding to retain or remove them; they can indicate real variation or error.

Confidence intervals require assumptions; violation of assumptions reduces their reliability.

Compare distributions with both plots (side-by-side boxplots or histograms) and summary measures.

State the sampling method and sample size when describing a population from sample data.

Built with v0