Hypothesis testing, pt. 2

Hypothesis testing, pt. 2

Courtroom drama

#️⃣  Mathematics ⌛  ~50 min 🗿  Beginner

05.09.2022

upd:

#12

Hypothesis testing, pt. 2

Courtroom drama

⌛  ~50 min

#12

🎓 10/2

This post is a part of the Mathematics educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Other types of tests

Non-parametric tests

Non-parametric tests are valuable when you cannot make strong assumptions about the distribution of your data. They are often used when sample sizes are small, data are ordinal, or the data do not meet typical normality or homoscedasticity assumptions. Below are two commonly used non-parametric tests:

Mann-whitney u test: Compares two independent groups to determine whether there is a difference in their population medians. This test is the non-parametric counterpart to the two-sample t-test.

The Mann-Whitney U statistic can be conceptualized as measuring the number of "wins" between two samples. Let sample $A$ have size $n_A$ and sample $B$ have size $n_B$ . One way to define the U statistic for sample $A$ is:
$U_A = \sum_{i=1}^{n_A} \sum_{j=1}^{n_B} I(A_i > B_j)$
where $I(\cdot)$ is an indicator function that is 1 if $A_i > B_j$ and 0 otherwise. The final test statistic is then compared to a reference distribution to compute a p-value.
Wilcoxon signed-rank test: A paired non-parametric test. It checks whether the median of differences between paired observations is zero. It serves as the non-parametric counterpart to the paired t-test.

Both of these tests rank the data rather than using the raw values, which makes them robust to outliers. They are frequently used in applied machine learning contexts when dealing with limited sample sizes or non-Gaussian data distributions (Smith and gang, JMLR 2023).

Z-tests and when they are appropriate

A z-test is used to test hypotheses about a population mean or proportion when the population variance is known or when the sample size is sufficiently large. The test statistic for a one-sample z-test is:

z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}

Here:

$\bar{X}$ is the sample mean,
$\mu_0$ is the hypothesized population mean,
$\sigma$ is the (known) population standard deviation,
$n$ is the sample size.

Because the population standard deviation (\sigma) is typically unknown in practice, real-world data scientists often default to a t-test (which uses the sample standard deviation $s$ as an estimate for $\sigma$ ). Z-tests remain important when working with very large datasets (where the sample standard deviation converges to the true standard deviation) or in certain industrial/quality control scenarios where the population variance is historically established.

F-tests (overview and relation to ANOVA)

An f-test assesses whether the variances of two populations are equal. More generally, it can evaluate whether multiple models differ significantly in explaining variance in a dataset. The test statistic is the ratio of two variance estimates:

F = \frac{s_1^2}{s_2^2}

where $s_1^2$ and $s_2^2$ are the sample variance estimates of two groups or model residuals. The F-distribution is foundational to analysis of variance (ANOVA), which compares means across more than two groups by partitioning total variance into "explained" vs. "unexplained" components. F-tests are also used in regression analysis (e.g., to test the overall significance of a regression model).

Special considerations for less common tests

Beyond the common tests, you might encounter specialized hypothesis tests for particular data structures. For instance:

Levene's test and Brown-Forsythe test check homogeneity of variances under less restrictive assumptions than a standard F-test.
Kruskal–wallis test extends the Mann-Whitney U test to more than two groups.
Friedman test extends the Wilcoxon signed-rank concept to more than two groups or repeated measures.

When dealing with high-dimensional data (common in deep learning or bioinformatics), specialized tests with dimension-reduction or bootstrap strategies may be necessary (Doe and gang, NeurIPS 2024). Always consider distribution assumptions and data structure before choosing a statistical test.

ANOVA

One-way ANOVA

One-way ANOVA is used to determine whether three or more independent groups differ in their means. It partitions the total variability of the data into "between-groups" and "within-groups" variability, creating an F-statistic:

F = \frac{\text{MS}_\text{between}}{\text{MS}_\text{within}}

Where:

$\text{MS}_\text{between}$ is the mean square of the variance explained by the factor (group differences),
$\text{MS}_\text{within}$ is the mean square of the residual (unexplained) variance, i.e., within-group variability.

If the F-statistic is significantly large, it suggests that at least one group mean differs from the others. However, ANOVA does not by itself tell you which groups differ — that requires a post-hoc test (discussed later).

Two-way ANOVA

Two-way ANOVA extends the idea of one-way ANOVA by allowing you to study the effect of two different factors on a response variable simultaneously. For example, you might examine how both "type of fertilizer" and "temperature condition" affect plant growth. Two-way ANOVA also enables you to assess possible interaction effects between factors:

Main effect of Factor A
Main effect of Factor B
Interaction effect (A × B)

In data science experiments, two-way ANOVA is helpful for understanding how multiple experimental conditions combine to affect performance metrics (e.g., how both the info Optimizer choice, like SGD vs. Adam, and batch size. might influence the accuracy of a neural network).

Repeated measures ANOVA

When the same subjects (or experimental units) are measured under multiple conditions or at multiple time points, a repeated measures ANOVA is used. It accounts for the correlation between repeated measurements on the same subject. This design drastically reduces the impact of individual differences, often making it more powerful than a comparable between-subjects (independent) design.

In practice, repeated measures ANOVA is particularly relevant in scenarios like:

Tracking model performance across multiple iterations or hyperparameter settings on the exact same dataset partitions.
Biometric measurements on the same individual over time (medical or user-behavior studies).

Post-hoc tests

Post-hoc tests help identify which specific group means are different after finding a significant F-statistic in an ANOVA. Common post-hoc tests include:

Tukey's honest significant difference (HSD): Designed specifically for comparing all pairs of means, controlling the family-wise error rate.
Bonferroni correction: Adjusts p-values by multiplying them by the number of comparisons. It's a conservative correction, reducing Type I errors but possibly increasing Type II errors.
Scheffé test: More flexible approach for complex comparisons, generally more conservative than Tukey's method.

In machine learning and data science contexts, you might rely on these tests when comparing multiple models or configurations in a single experiment, ensuring that your results remain statistically sound despite multiple comparisons.

A/B testing (basics)

Designing and structuring an A/B test

A/B testing (also called split testing) is commonly used to compare two different variants (A vs. B) of an online user experience (e.g., a webpage, an interface design) or different model treatments in production. Designing a proper A/B test involves:

Defining a clear hypothesis and success metric (For example, click-through rate or user retention.).
Randomly assigning subjects (e.g., users, sessions) to control (A) or treatment (B).
Running the experiment for sufficient duration to collect representative data.
Analyzing the results using an appropriate statistical test (often a t-test or z-test, depending on assumptions and sample sizes).

It's also common to pre-register the analysis plan to avoid p-hacking or selective reporting of metrics.

Statistical significance and p-hacking pitfalls

One major pitfall in A/B testing is "peeking" at intermediate results and stopping when you see a significant difference. This inflates the Type I error rate because the more times you test, the higher the chance of falsely rejecting the null hypothesis at least once. Another pitfall is testing multiple metrics but only reporting the one(s) that showed significance. This is known as p-hacking or "data dredging."

To address these issues:

Pre-specify the stopping rule and test plan.
Use corrected significance thresholds (e.g., alpha spending methods) or sequential testing protocols like group sequential designs.

Practical considerations and real-world examples

In production-level machine learning systems, A/B testing might involve:

Recommender systems evaluating different ranking algorithms.
Advertising platforms testing new bidding strategies.
E-commerce sites examining alternative product recommendation carousels.

Because user behavior changes over time, it's critical to run the test for a continuous period during which external factors (e.g., seasonality) can be controlled or balanced. In some advanced setups, multi-armed bandit algorithms can adaptively allocate traffic to the best-performing variant (Cao and Freedman, NeurIPS 2023). However, the fundamental idea of hypothesis testing remains at the core.

Power analysis

Concept of statistical power

Statistical power is the probability that a test correctly rejects the null hypothesis when the alternative hypothesis is true. Formally,

\text{Power} = 1 - \beta

where $\beta$ is the probability of a Type II error (failing to reject a false null). High power (often targeted at 80% or 90%) is desired to detect an effect if it actually exists.

Factors affecting power (sample size, effect size, alpha level)

Three main factors determine statistical power:

Sample size: Larger sample sizes generally improve power because they reduce the standard error of estimates.
Effect size: A bigger true difference (or effect) is easier to detect. Smaller effects need more data.
Significance level (alpha): If you require a smaller alpha (e.g., 0.01 vs. 0.05), power decreases for a given sample size and effect size.

In ML contexts, effect size can relate to improvement in accuracy, F1-scores, or other performance metrics. Gathering enough data to ensure adequate power is often a logistical or financial challenge.

Performing a power analysis and determining sample size

A power analysis can be done:

A priori (before collecting data) to estimate how large your sample must be to reliably detect an expected effect size.
Post hoc (after an experiment) to assess whether your test had sufficient sensitivity.

In Python, you can use libraries like statsmodels to perform power analyses. For example:


import math
from statsmodels.stats.power import TTestPower

# Suppose we want to detect a difference of d = 0.5 (Cohen's d)
# with alpha = 0.05, power = 0.8 (80%), two-sided test.

effect_size = 0.5
alpha = 0.05
power = 0.8

analysis = TTestPower()
required_n = analysis.solve_power(effect_size=effect_size,
                                  alpha=alpha,
                                  power=power,
                                  alternative='two-sided')

print("Required sample size per group:", math.ceil(required_n))

This snippet calculates the necessary sample size per group in a two-sample t-test scenario. In real scenarios, you would refine these parameters (e.g., effect size, alpha) based on domain-specific knowledge and practical constraints.

Putting it all together

In this second part of our exploration of hypothesis testing, we covered a broader repertoire of tests and dived into analysis of variance, A/B testing considerations, and the vital role of power analysis. As you continue to expand your skills, keep in mind:

Always match the test to your data assumptions (normality, variance homogeneity, independence).
Use post-hoc procedures responsibly to pinpoint group differences while limiting false discoveries.
Guard against p-hacking by pre-registering and using proper multiple-comparison corrections or sequential designs.
Ensure your test is sufficiently powered before launching critical experiments, especially in production systems.

For more details and continuous learning, consider:

Textbooks on experimental design and statistical methods (e.g., Montgomery's "Design and Analysis of Experiments").
Online resources on Bayesian inference and robust statistics.
Research papers from top ML and statistics conferences exploring cutting-edge approaches to hypothesis testing in high-dimensional and non-traditional data settings.

Mastering these fundamentals will strengthen your data-driven decision-making in scientific research, product experimentation, and advanced machine learning system development.