banner
Hypothesis testing
Drink whiskey, reject hypotheses
#️⃣  Mathematics ⌛  ~50 min 🗿  Beginner
05.09.2022
upd:
#11

views-badgeviews-badge
banner
Hypothesis testing
Drink whiskey, reject hypotheses
⌛  ~50 min
#11


🎓 9/167

This post is a part of the Mathematics educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!


Hypothesis testing is a cornerstone of statistical inference and a critical tool in the data scientist's toolkit. In essence, it is a systematic procedure for deciding whether observed data provides sufficient evidence to reject or fail to reject an initially assumed premise (often called the null hypothesis). Whether you are evaluating the efficacy of a new drug, testing whether a new marketing strategy leads to better conversion rates, or comparing the performance of machine learning models, hypothesis testing offers a rigorous framework to guide your decisions.

Because data in real-world applications can be noisy or limited, it's insufficient to rely on raw intuition or heuristics. Hypothesis testing helps quantify the evidence and manage uncertainty. This practice is rooted in classical statistics but remains highly relevant in modern machine learning workflows — especially when dealing with large-scale experiments, such as online A/B testing or hyperparameter tuning.

Key concepts in hypothesis testing include the formulation of a null and an alternative hypothesis, understanding error types (Type I and Type II), selecting test statistics, and interpreting significance levels and p-values. In data science, these techniques are often integrated with larger pipelines (e.g., evaluating models or features) and can guide automated decision-making. Researchers have further extended these ideas in advanced settings, such as high-dimensional data or multiple hypothesis testing (Smith and gang, NeurIPS 2022), underscoring hypothesis testing's continued importance and evolution.

Basic framework of hypothesis testing

Null and alternative hypotheses

A hypothesis test begins by specifying two statements: the null hypothesis (H0H_0) and the alternative hypothesis (H1H_1 or HaH_a).

  • The null hypothesis H0H_0: This is usually the "status quo" or the baseline assumption. It often states that there is "no effect" or "no difference." For example, H0H_0 might be "the mean difference between two treatments is zero."
  • The alternative hypothesis H1H_1: This represents the new claim we suspect may be true. Continuing our example, H1H_1 could be "the mean difference between the two treatments is not zero."

The objective of the test is to determine whether the observed data is unlikely enough under H0H_0 to justify rejecting it in favor of H1H_1.

Type I and type II errors

In making a decision about the null hypothesis, two kinds of errors are possible:

  • Type I error (\\alpha): This is the error of rejecting H0H_0 when it is actually true. It is also referred to as a "false positive."
  • Type II error (\\beta): This is the error of failing to reject H0H_0 when it is actually false. It is also referred to as a "false negative."

The probability of committing a Type I error is typically denoted by \\alpha, while the probability of committing a Type II error is \\beta.

mysterious_frog

An image was requested, but the frog was found.

Alt: "Graphical illustration of Type I and Type II errors"

Caption: "An illustration showing overlapping distributions under the null and alternative hypotheses, highlighting the regions where Type I and Type II errors occur."

Error type: missing path

Significance level and p-values

The significance level, denoted \\alpha, is a threshold that defines the maximum tolerable probability of committing a Type I error. Common values for \\alpha are 0.05, 0.01, or 0.001, depending on how conservative one wishes to be.

A p-value is the probability of observing data at least as extreme as what we have, assuming H0H_0 is true. If this probability is below the chosen significance level \\alpha, we reject H0H_0. If it is above \\alpha, we fail to reject H0H_0.

Formally, the p-value is calculated as:

p-value=P(Test Statisticobserved valueH0 is true) p\text{-value} = P(\text{Test Statistic} \ge \text{observed value} \mid H_0 \text{ is true})

depending on whether the test is one-sided or two-sided, this probability will be calculated accordingly.

Confidence intervals and their relationship to hypothesis testing

A confidence interval is a range of values, constructed from the data, that is believed to contain the true parameter (e.g., a mean) with a certain probability (usually 95% or 99%).

The connection between confidence intervals and hypothesis testing is best illustrated by a 95% confidence interval for the mean: if mu0 \\mu_0 (the hypothesized mean under H0H_0) lies outside this interval, then a test of H0:mu=mu0H_0: \\mu = \\mu_0 at the 5% significance level would lead you to reject H0H_0. Conversely, if mu0 \\mu_0 falls inside the interval, you would fail to reject H0H_0 at that level.

Test statistics

Definition of a test statistic

A test statistic is a numerical summary of the data that encapsulates the evidence against the null hypothesis. In classical hypothesis testing, the procedure typically follows:

  1. Compute the test statistic from the sample (e.g., the sample mean, or a standardized version of the sample mean).
  2. Determine the probability of observing this (or a more extreme) statistic under H0H_0.

Mathematically, a test statistic is often a function of the sample X\mathbf{X}. We can denote it as T(X)T(\mathbf{X}). For instance, in a t-test, the test statistic is typically a standardized difference between the sample mean and the hypothesized mean.

Distribution of test statistics (z-distribution, t-distribution, etc.)

Under the null hypothesis, test statistics follow specific distributions:

  • Z-distribution: When the population variance is known or the sample size is large enough (invoking the Central Limit Theorem), the standardized mean often follows an approximately standard normal distribution.
  • T-distribution: When the population variance is unknown and the sample size is relatively small, the test statistic follows a t-distribution with degrees of freedom related to the sample size.
  • Chi-squared distribution: Commonly appears in variance tests or tests for categorical data (chi-squared tests).
  • F-distribution: Often used in analysis of variance (ANOVA) or comparing two variances.

The probability (or p-value) is computed based on how extreme the observed test statistic is within the corresponding distribution.

mysterious_frog

An image was requested, but the frog was found.

Alt: "Various probability distributions"

Caption: "A conceptual overview of the Z, t, and chi-squared distributions and how they vary in shape."

Error type: missing path

Choosing the right test statistic for your data

Selecting an appropriate test statistic requires careful consideration of:

  • Data type (continuous, categorical, etc.)
  • Sample size
  • Underlying assumptions (e.g., normality, equal variances)
  • The research question (e.g., difference of means, independence of variables, etc.)

In many data science applications, especially with large sample sizes, the Z-based approach may be preferred for computational efficiency. However, for smaller samples or when the population variance is unknown, t-tests or non-parametric tests are more suitable.

T-tests

One-sample t-test

A one-sample t-test is used to determine whether the mean of a single group differs significantly from a known or hypothesized mean mu0 \\mu_0 .

The test statistic is:

T=Xˉμ0s/n, T = \frac{\bar{X} - \mu_0}{s / \sqrt{n}},

where Xˉ \bar{X} is the sample mean, μ0 \mu_0 is the hypothesized mean, s s is the sample standard deviation, and n n is the sample size. This statistic follows a t-distribution with n1 n - 1 degrees of freedom (assuming the sample is drawn from a normally distributed population or infoIn practice, with large n, the t-distribution approximates the normal distribution.).

If T |T| is large enough (i.e., the corresponding p-value < \\alpha), we reject H0H_0 and conclude that the sample mean differs from μ0 \mu_0 .

Below is a short Python snippet illustrating how to perform a one-sample t-test using scipy.stats:


import numpy as np
from scipy import stats

# Sample data
data = np.array([4.2, 3.9, 4.5, 5.1, 4.8])
mu_0 = 4.0  # Hypothesized mean

t_stat, p_value = stats.ttest_1samp(data, mu_0)
print(f"t-statistic = {t_stat}, p-value = {p_value}")

Independent two-sample t-test

An independent two-sample t-test examines whether the means of two independent groups differ significantly. Suppose we have two samples X1,X2,...,Xn1X_1, X_2, ..., X_{n_1} and Y1,Y2,...,Yn2Y_1, Y_2, ..., Y_{n_2} from (possibly) two different populations.

The test statistic for the equal-variance (pooled) version is:

T=XˉYˉsp1n1+1n2 T = \frac{\bar{X} - \bar{Y}}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}

where Xˉ \bar{X} and Yˉ \bar{Y} are the sample means, and sp s_p is the pooled standard deviation:

sp=(n11)sX2+(n21)sY2n1+n22 s_p = \sqrt{\frac{(n_1 - 1) s_X^2 + (n_2 - 1) s_Y^2}{n_1 + n_2 - 2}}

For the unequal-variance (Welch) version, the formula adjusts to avoid assuming equal variances, and the degrees of freedom are approximated by the Welch–Satterthwaite equation.

Paired t-test

A paired t-test is used when the data are paired or dependent, such as measurements before and after a treatment on the same subjects. You can transform the problem into a one-sample t-test by taking the difference between each pair of observations. The test statistic becomes:

T=Dˉ0sD/n, T = \frac{\bar{D} - 0}{s_D / \sqrt{n}},

where Dˉ \bar{D} is the mean of the differences and sD s_D is the standard deviation of the differences.

Assumptions of t-tests and how to verify them

All t-tests make several assumptions:

  1. Normality: The data (or differences in paired designs) should be approximately normally distributed.
  2. Independence: Observations in each group should be independent of each other (unless it is a paired t-test, in which case each pair is dependent, but different pairs remain independent).
  3. Equal variance (for the standard two-sample t-test): When performing the pooled variance version, each group should come from a population with the same variance.

In practice, you can use visual methods (e.g., Q–Q plots) or formal tests (e.g., Shapiro–Wilk for normality, Levene's test for variance) to check these assumptions. When assumptions are significantly violated, alternative methods like the Welch t-test (for unequal variances) or non-parametric tests (e.g., Wilcoxon rank-sum) may be preferable.

Chi-squared test

Chi-squared goodness-of-fit test

The chi-squared goodness-of-fit test determines how well an observed frequency distribution matches an expected distribution. For example, suppose you have a categorical variable with k k categories, and you want to see if the distribution of counts across these categories follows a specified theoretical model.

The test statistic is:

chi2=i=1k(OiEi)2Ei, \\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i},

where Oi O_i is the observed count in category i i and Ei E_i is the expected count under the null hypothesis. Under H0H_0, \\chi^2 follows a chi-squared distribution with k1 k - 1 degrees of freedom (adjusted further if parameters are estimated).

Chi-squared test of independence (contingency tables)

The chi-squared test of independence evaluates whether two categorical variables are associated. For instance, you might ask if customer gender is independent of whether they respond to a promotional email.

You arrange the counts of observations into a contingency table, compute the expected frequencies under independence, then calculate a chi-squared statistic similarly to the goodness-of-fit formula. If the test statistic is large (p-value < \\alpha), you conclude that the two variables are not independent.

When to use the chi-squared test

Chi-squared tests are suitable for:

  • Categorical data with multiple categories.
  • Testing conformity to an expected theoretical distribution.
  • Checking independence or association between categorical variables (in contingency tables).

They rely on the assumption that expected counts in each cell of the contingency table are sufficiently large (a common rule of thumb is at least 5). For smaller expected counts, alternatives like Fisher's exact test might be more reliable.

Additional considerations: multiple hypothesis testing and power

In data science and machine learning, you may run into scenarios involving many simultaneous hypothesis tests — for example, testing thousands of features for significance at once. Conducting multiple tests inflates the probability of finding false positives. Several advanced methods, such as the Bonferroni correction or the False Discovery Rate (FDR) procedure, address this issue (Benjamini and Hochberg, JMLR 1995).

Another important concept is the power of a test, defined as 1beta1 - \\beta, where \\beta is the probability of a Type II error. A test with higher power is better at detecting real effects. Power depends on:

  • Sample size
  • Effect size (i.e., how large the difference or effect is)
  • Significance level (\\alpha)

Balancing \\alpha, \\beta, and sample size is a common challenge in experimental design and can guide practitioners to collect the right amount of data for reliable conclusions.

Conclusion

Hypothesis testing provides a formal mechanism to assess whether a given effect or difference is real or can be attributed to chance. Whether you are comparing means with a t-test or assessing independence with a chi-squared test, understanding the conceptual framework — formulating hypotheses, selecting test statistics, and interpreting p-values and confidence intervals — is vital. These tests form the backbone of quantitative research in data science and machine learning, enabling practitioners to draw inferences from complex datasets with measured confidence.

Looking ahead, modern research and industrial applications often involve large-scale or high-dimensional data, leading to challenges in multiple hypothesis testing or ensuring that classical assumptions (normality, independence, etc.) hold. Researchers continue to develop new procedures (e.g., robust testing, permutation-based methods, advanced multiple correction techniques) to keep hypothesis testing both statistically rigorous and practically relevant in the face of big data. By mastering the fundamentals explored here, data scientists can navigate these complexities and make data-driven decisions with confidence.

kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo
kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo