Hypothesis testing

Hypothesis testing

Drink whiskey, reject hypotheses

#️⃣  Mathematics ⌛  ~50 min 🗿  Beginner

05.09.2022

upd:

#11

Hypothesis testing

Drink whiskey, reject hypotheses

⌛  ~50 min

#11

🎓 9/167

This post is a part of the Mathematics educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Hypothesis testing is a cornerstone of statistical inference and a critical tool in the data scientist's toolkit. In essence, it is a systematic procedure for deciding whether observed data provides sufficient evidence to reject or fail to reject an initially assumed premise (often called the null hypothesis). Whether you are evaluating the efficacy of a new drug, testing whether a new marketing strategy leads to better conversion rates, or comparing the performance of machine learning models, hypothesis testing offers a rigorous framework to guide your decisions.

Because data in real-world applications can be noisy or limited, it's insufficient to rely on raw intuition or heuristics. Hypothesis testing helps quantify the evidence and manage uncertainty. This practice is rooted in classical statistics but remains highly relevant in modern machine learning workflows — especially when dealing with large-scale experiments, such as online A/B testing or hyperparameter tuning.

Key concepts in hypothesis testing include the formulation of a null and an alternative hypothesis, understanding error types (Type I and Type II), selecting test statistics, and interpreting significance levels and p-values. In data science, these techniques are often integrated with larger pipelines (e.g., evaluating models or features) and can guide automated decision-making. Researchers have further extended these ideas in advanced settings, such as high-dimensional data or multiple hypothesis testing (Smith and gang, NeurIPS 2022), underscoring hypothesis testing's continued importance and evolution.

Basic framework of hypothesis testing

Null and alternative hypotheses

A hypothesis test begins by specifying two statements: the null hypothesis ( $H_0$ ) and the alternative hypothesis ( $H_1$ or $H_a$ ).

The null hypothesis $H_0$ : This is usually the "status quo" or the baseline assumption. It often states that there is "no effect" or "no difference." For example, $H_0$ might be "the mean difference between two treatments is zero."
The alternative hypothesis $H_1$ : This represents the new claim we suspect may be true. Continuing our example, $H_1$ could be "the mean difference between the two treatments is not zero."

The objective of the test is to determine whether the observed data is unlikely enough under $H_0$ to justify rejecting it in favor of $H_1$ .

Type I and type II errors

In making a decision about the null hypothesis, two kinds of errors are possible:

Type I error (\\alpha): This is the error of rejecting $H_0$ when it is actually true. It is also referred to as a "false positive."
Type II error (\\beta): This is the error of failing to reject $H_0$ when it is actually false. It is also referred to as a "false negative."

The probability of committing a Type I error is typically denoted by \\alpha, while the probability of committing a Type II error is \\beta.

An image was requested, but the frog was found.

Alt: "Graphical illustration of Type I and Type II errors"

Caption: "An illustration showing overlapping distributions under the null and alternative hypotheses, highlighting the regions where Type I and Type II errors occur."

Error type: missing path

Significance level and p-values

The significance level, denoted \\alpha, is a threshold that defines the maximum tolerable probability of committing a Type I error. Common values for \\alpha are 0.05, 0.01, or 0.001, depending on how conservative one wishes to be.

A p-value is the probability of observing data at least as extreme as what we have, assuming $H_0$ is true. If this probability is below the chosen significance level \\alpha, we reject $H_0$ . If it is above \\alpha, we fail to reject $H_0$ .

Formally, the p-value is calculated as:

p\text{-value} = P(\text{Test Statistic} \ge \text{observed value} \mid H_0 \text{ is true})

depending on whether the test is one-sided or two-sided, this probability will be calculated accordingly.

Confidence intervals and their relationship to hypothesis testing

A confidence interval is a range of values, constructed from the data, that is believed to contain the true parameter (e.g., a mean) with a certain probability (usually 95% or 99%).

The connection between confidence intervals and hypothesis testing is best illustrated by a 95% confidence interval for the mean: if $\\mu_0$ (the hypothesized mean under $H_0$ ) lies outside this interval, then a test of $H_0: \\mu = \\mu_0$ at the 5% significance level would lead you to reject $H_0$ . Conversely, if $\\mu_0$ falls inside the interval, you would fail to reject $H_0$ at that level.

Test statistics

Definition of a test statistic

A test statistic is a numerical summary of the data that encapsulates the evidence against the null hypothesis. In classical hypothesis testing, the procedure typically follows:

Compute the test statistic from the sample (e.g., the sample mean, or a standardized version of the sample mean).
Determine the probability of observing this (or a more extreme) statistic under $H_0$ .

Mathematically, a test statistic is often a function of the sample $\mathbf{X}$ . We can denote it as $T(\mathbf{X})$ . For instance, in a t-test, the test statistic is typically a standardized difference between the sample mean and the hypothesized mean.

Distribution of test statistics (z-distribution, t-distribution, etc.)

Under the null hypothesis, test statistics follow specific distributions:

Z-distribution: When the population variance is known or the sample size is large enough (invoking the Central Limit Theorem), the standardized mean often follows an approximately standard normal distribution.
T-distribution: When the population variance is unknown and the sample size is relatively small, the test statistic follows a t-distribution with degrees of freedom related to the sample size.
Chi-squared distribution: Commonly appears in variance tests or tests for categorical data (chi-squared tests).
F-distribution: Often used in analysis of variance (ANOVA) or comparing two variances.

The probability (or p-value) is computed based on how extreme the observed test statistic is within the corresponding distribution.

An image was requested, but the frog was found.

Alt: "Various probability distributions"

Caption: "A conceptual overview of the Z, t, and chi-squared distributions and how they vary in shape."

Error type: missing path

Choosing the right test statistic for your data

Selecting an appropriate test statistic requires careful consideration of:

Data type (continuous, categorical, etc.)
Sample size
Underlying assumptions (e.g., normality, equal variances)
The research question (e.g., difference of means, independence of variables, etc.)

In many data science applications, especially with large sample sizes, the Z-based approach may be preferred for computational efficiency. However, for smaller samples or when the population variance is unknown, t-tests or non-parametric tests are more suitable.

T-tests

One-sample t-test

A one-sample t-test is used to determine whether the mean of a single group differs significantly from a known or hypothesized mean $\\mu_0$ .

The test statistic is:

T = \frac{\bar{X} - \mu_0}{s / \sqrt{n}},

where $\bar{X}$ is the sample mean, $\mu_0$ is the hypothesized mean, $s$ is the sample standard deviation, and $n$ is the sample size. This statistic follows a t-distribution with $n - 1$ degrees of freedom (assuming the sample is drawn from a normally distributed population or info In practice, with large n, the t-distribution approximates the normal distribution.).

If $|T|$ is large enough (i.e., the corresponding p-value < \\alpha), we reject $H_0$ and conclude that the sample mean differs from $\mu_0$ .

Below is a short Python snippet illustrating how to perform a one-sample t-test using scipy.stats:


import numpy as np
from scipy import stats

# Sample data
data = np.array([4.2, 3.9, 4.5, 5.1, 4.8])
mu_0 = 4.0  # Hypothesized mean

t_stat, p_value = stats.ttest_1samp(data, mu_0)
print(f"t-statistic = {t_stat}, p-value = {p_value}")

Independent two-sample t-test

An independent two-sample t-test examines whether the means of two independent groups differ significantly. Suppose we have two samples $X_1, X_2, ..., X_{n_1}$ and $Y_1, Y_2, ..., Y_{n_2}$ from (possibly) two different populations.

The test statistic for the equal-variance (pooled) version is:

T = \frac{\bar{X} - \bar{Y}}{s_p \cdot \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}

where $\bar{X}$ and $\bar{Y}$ are the sample means, and $s_p$ is the pooled standard deviation:

s_p = \sqrt{\frac{(n_1 - 1) s_X^2 + (n_2 - 1) s_Y^2}{n_1 + n_2 - 2}}

For the unequal-variance (Welch) version, the formula adjusts to avoid assuming equal variances, and the degrees of freedom are approximated by the Welch–Satterthwaite equation.

Paired t-test

A paired t-test is used when the data are paired or dependent, such as measurements before and after a treatment on the same subjects. You can transform the problem into a one-sample t-test by taking the difference between each pair of observations. The test statistic becomes:

T = \frac{\bar{D} - 0}{s_D / \sqrt{n}},

where $\bar{D}$ is the mean of the differences and $s_D$ is the standard deviation of the differences.

Assumptions of t-tests and how to verify them

All t-tests make several assumptions:

Normality: The data (or differences in paired designs) should be approximately normally distributed.
Independence: Observations in each group should be independent of each other (unless it is a paired t-test, in which case each pair is dependent, but different pairs remain independent).
Equal variance (for the standard two-sample t-test): When performing the pooled variance version, each group should come from a population with the same variance.

In practice, you can use visual methods (e.g., Q–Q plots) or formal tests (e.g., Shapiro–Wilk for normality, Levene's test for variance) to check these assumptions. When assumptions are significantly violated, alternative methods like the Welch t-test (for unequal variances) or non-parametric tests (e.g., Wilcoxon rank-sum) may be preferable.

Chi-squared test

Chi-squared goodness-of-fit test

The chi-squared goodness-of-fit test determines how well an observed frequency distribution matches an expected distribution. For example, suppose you have a categorical variable with $k$ categories, and you want to see if the distribution of counts across these categories follows a specified theoretical model.

The test statistic is:

\\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i},

where $O_i$ is the observed count in category $i$ and $E_i$ is the expected count under the null hypothesis. Under $H_0$ , \\chi^2 follows a chi-squared distribution with $k - 1$ degrees of freedom (adjusted further if parameters are estimated).

Chi-squared test of independence (contingency tables)

The chi-squared test of independence evaluates whether two categorical variables are associated. For instance, you might ask if customer gender is independent of whether they respond to a promotional email.

You arrange the counts of observations into a contingency table, compute the expected frequencies under independence, then calculate a chi-squared statistic similarly to the goodness-of-fit formula. If the test statistic is large (p-value < \\alpha), you conclude that the two variables are not independent.

When to use the chi-squared test

Chi-squared tests are suitable for:

Categorical data with multiple categories.
Testing conformity to an expected theoretical distribution.
Checking independence or association between categorical variables (in contingency tables).

They rely on the assumption that expected counts in each cell of the contingency table are sufficiently large (a common rule of thumb is at least 5). For smaller expected counts, alternatives like Fisher's exact test might be more reliable.

Additional considerations: multiple hypothesis testing and power

In data science and machine learning, you may run into scenarios involving many simultaneous hypothesis tests — for example, testing thousands of features for significance at once. Conducting multiple tests inflates the probability of finding false positives. Several advanced methods, such as the Bonferroni correction or the False Discovery Rate (FDR) procedure, address this issue (Benjamini and Hochberg, JMLR 1995).

Another important concept is the power of a test, defined as $1 - \\beta$ , where \\beta is the probability of a Type II error. A test with higher power is better at detecting real effects. Power depends on:

Sample size
Effect size (i.e., how large the difference or effect is)
Significance level (\\alpha)

Balancing \\alpha, \\beta, and sample size is a common challenge in experimental design and can guide practitioners to collect the right amount of data for reliable conclusions.

Conclusion

Hypothesis testing provides a formal mechanism to assess whether a given effect or difference is real or can be attributed to chance. Whether you are comparing means with a t-test or assessing independence with a chi-squared test, understanding the conceptual framework — formulating hypotheses, selecting test statistics, and interpreting p-values and confidence intervals — is vital. These tests form the backbone of quantitative research in data science and machine learning, enabling practitioners to draw inferences from complex datasets with measured confidence.

Looking ahead, modern research and industrial applications often involve large-scale or high-dimensional data, leading to challenges in multiple hypothesis testing or ensuring that classical assumptions (normality, independence, etc.) hold. Researchers continue to develop new procedures (e.g., robust testing, permutation-based methods, advanced multiple correction techniques) to keep hypothesis testing both statistically rigorous and practically relevant in the face of big data. By mastering the fundamentals explored here, data scientists can navigate these complexities and make data-driven decisions with confidence.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content