A non-boring intro to statistics, pt. 2

A non-boring intro to statistics, pt. 2

The science of lies

#️⃣  Mathematics ⌛  ~50 min 🗿  Beginner

23.08.2022

upd:

A non-boring intro to statistics, pt. 2

The science of lies

⌛  ~50 min

🎓 7/2

This post is a part of the Mathematics educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Statistically, a liar is believed more if he cites statistics.

Sampling

Let's get a bit closer to machine learning.

Sampling is the process of selecting a subset of data points or observations from a larger population. This is done so we can analyze the subset and make reasonable inferences about the entire population — without having to collect or observe every possible data point. In data science and machine learning, sampling is pervasive: from building training sets to constructing test sets, we rely on carefully chosen samples to train and evaluate our models. When done correctly, sampling can save enormous resources (time, cost, computational effort) while preserving the essential characteristics of the overall population.

Sampling methods and strategies

Sampling methods can be broadly divided into probability-based and non-probability-based strategies:

Probability-based sampling: Each member of the population has a known (and typically non-zero) probability of being selected.
- Simple random sampling: Every individual in the population has an equal chance of being selected. This is often done by assigning random numbers to the population and picking the smallest (or largest) subset or by using randomizing functions in software.
- Systematic sampling: You pick a random start point and then select every $k$ -th item, where $k$ is determined by dividing the population size by the desired sample size.
- Stratified sampling: The population is divided into homogeneous subgroups, or strata (e.g., by age group, region, or income bracket). You then randomly select individuals from each stratum in proportion to that stratum's size in the population.
- Cluster sampling: The population is split into clusters (e.g., geographical areas, organizational units). A subset of clusters is then selected randomly, and within each chosen cluster, you either collect data from every member or again select a random subset.
Non-probability-based sampling:
- Convenience sampling: Selecting participants or data points based on their easy availability (e.g., surveying people in a mall).
- Quota sampling: Ensuring that the sample meets certain quotas for predefined categories (e.g., 40% female, 60% male) but otherwise using a non-random approach within each category.
- Snowball sampling: Typically used when the population is hard to reach or hidden. Existing participants recruit future ones (e.g., surveying members of niche online forums).

Each approach has trade-offs. Probability-based methods typically allow for more rigorous statistical inference (e.g., confidence intervals, error bounds), whereas non-probability methods are sometimes faster or more practical in real-world settings.

Importance of representative samples

A representative sample is one that captures the essential characteristics of the broader population. If the sample systematically overrepresents or underrepresents certain features, any statistical inference from that sample could be biased or misleading. For instance, if you are polling political opinions and only sample individuals who frequently use social media, your results might not represent those who rarely use the internet — leading to skewed conclusions.

Importance of sampling in statistics

Sampling underpins almost every statistical procedure. Many foundational techniques — like hypothesis testing, confidence interval construction, or regression analysis — depend on the assumption that the data analyzed are drawn from a representative sample. If this assumption is violated, our estimates of the population parameters (e.g., mean, variance) may be inaccurate, making all subsequent analyses questionable.

Moreover, in machine learning, we often split a dataset into training, validation, and test sets. Doing so correctly relies on sound sampling strategies that preserve overall class distributions (in classification) or other important characteristics. This helps ensure that model performance metrics generalize to the real-world data distribution.

Types of sampling

Here are the key types in concise form:

Random (simple random sampling): Every member of the population has an equal probability of being included.
Stratified sampling: The population is split by known characteristics (strata), and a random sample is taken within each group to ensure representation.
Cluster sampling: Natural groupings in the population serve as "clusters." A random set of clusters is chosen, and data are collected from within each selected cluster.
Systematic sampling: A fixed, periodic interval is used after a random start.

Sampling bias and how to avoid it

Sampling bias occurs when some members of the population are more likely to be chosen than others, distorting inferences about the population. Common causes include:

Selection bias: The way participants are selected (e.g., using only volunteers) is not representative.
Non-response bias: Individuals who choose not to respond differ systematically from those who do.
Undercoverage: Important segments of the population are insufficiently included (e.g., not having phone numbers for rural households).

To mitigate sampling bias:

Use random selection whenever possible.
Compare sample demographics to known population demographics (if available) to detect imbalances.
Employ techniques like weighting to account for underrepresented groups.
Ensure clear and accessible data-collection processes that minimize barriers to participation.

Reservoir Sampling

Reservoir sampling is a clever technique to draw a fixed-size random sample from a potentially large or unknown-size stream of data:

Fill the "reservoir" array of size $k$ with the first $k$ items.
For each item $i$ (counting from $k+1$ to $n$ ) in the stream:
- Generate a random number $r$ in $[1, i]$ .
- If $r \le k$ , replace the item in the reservoir at index $r$ with the new item $i$ .

This ensures that, after processing all $n$ items, each item has an equal probability ( $k/n$ ) of appearing in the reservoir. It is especially useful for streaming data or extremely large datasets where storing the entire data in memory is impractical.

import random

def reservoir_sampling(stream, k):
    """Return k uniformly random items from a stream."""
    reservoir = []
    
    # Fill reservoir with first k items
    for i in range(k):
        reservoir.append(stream[i])
    
    # Replace elements with gradually decreasing probability
    index = k
    while index < len(stream):
        # Random integer in [0, index]
        r = random.randint(0, index)
        if r < k:
            reservoir[r] = stream[index]
        index += 1
    
    return reservoir

# Example usage
data_stream = list(range(1, 10001))
sampled = reservoir_sampling(data_stream, 5)
print("Random sample from a large stream:", sampled)

Handling Imbalanced Classes with SMOTE

For imbalanced classification problems (e.g., fraud detection, where the fraudulent cases are rare), standard random sampling often leads to models that overlook minority classes. SMOTE (Synthetic Minority Over-sampling Technique) addresses this by generating synthetic samples of the minority class rather than merely duplicating existing points.

Step 1: For each minority class instance, find its k-nearest neighbors in the minority class.
Step 2: Randomly select one of those neighbors and generate a synthetic point along the line joining the two samples.

This way, the minority class distribution is augmented in feature space, reducing the risk of overfitting and helping the model learn decision boundaries more effectively.

An image was requested, but the frog was found.

Alt: "SMOTE illustration"

Caption: "Synthetic samples are generated along the line segments between a minority sample and its neighbors."

Error type: missing path

Weighted Sampling

Weighted sampling assigns different probabilities to different items based on importance or cost. In certain real-world scenarios, not all data points are equally informative or equally likely. Weighted sampling ensures that items of higher importance or underrepresented subpopulations are more likely to be included in the sample.

Bootstrap Sampling

Although widely known for forming confidence intervals (via the bootstrap method), bootstrap sampling can also be used to evaluate the stability or variability of statistical estimates. In this approach, you resample (with replacement) from your original dataset to create many "bootstrapped" datasets of the same size as the original. Each bootstrapped dataset provides a slightly different estimate of your parameter (e.g., mean, correlation), allowing you to assess how much that estimate fluctuates.

Exploring relationships between variables

Analyzing relationships between variables is central to understanding patterns in data. Correlation and covariance are two fundamental ways to measure how changes in one variable are associated with changes in another.

Correlation: positive, negative, and zero correlations

Correlation measures the strength and direction of a linear relationship between two variables. The most common measure is the Pearson correlation coefficient $\rho$ (for population) or $r$ (for sample):

r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}}.

Positive correlation: As one variable increases, the other tends to increase (e.g., height and weight).
Negative correlation: As one variable increases, the other tends to decrease (e.g., the time spent studying might negatively correlate with error rates on a test).
Zero (or near-zero) correlation: No linear relationship is observed (e.g., daily temperature vs. the number of letters in your name).

A value of $r = 1$ indicates a perfect positive linear relationship, $r = -1$ indicates a perfect negative linear relationship, and $r = 0$ indicates no linear relationship. Real-world data rarely show perfect linearity, but even moderate correlations can be significant in certain contexts.

Covariance

Covariance is a measure of how two variables vary together. It is given by:

\mathrm{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])].

In practice, we often estimate it using the sample covariance:

\hat{\mathrm{Cov}}(X, Y) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}).

If $\mathrm{Cov}(X, Y) > 0$ , $X$ and $Y$ tend to move in the same direction.
If $\mathrm{Cov}(X, Y) < 0$ , $X$ and $Y$ tend to move in opposite directions.
If $\mathrm{Cov}(X, Y) = 0$ , there is no linear association.

Covariance is not standardized and depends on the units of the variables. By normalizing covariance (dividing by the standard deviations of both variables), you get the correlation coefficient.

Why "correlation is not causation"

It is a common misconception to interpret correlation as evidence of a cause-and-effect relationship. A high correlation might be due to:

Confounding factors: A third variable influences both $X$ and $Y$ (e.g., "ice cream sales" and "drowning incidents" both increase in hot weather).
Reverse causality: Instead of $X$ causing $Y$ , it might be that $Y$ causes $X$ .
Coincidence: Sometimes variables correlate purely by chance in finite samples.

In practice, establishing causality typically requires controlled experiments (e.g., randomized controlled trials) or robust observational study designs. Advanced methods such as causal graphs (e.g., Judea Pearl's work) or quasi-experimental designs (e.g., difference-in-differences) can also be used.

Spearman's Rank Correlation

Spearman's rank correlation coefficient is a non-parametric measure of correlation that assesses how well the relationship between two variables can be described using a monotonic function. Unlike Pearson correlation, it relies on the rank of the data rather than the actual numeric values, making it more robust to outliers or non-linear relationships.

Formally, if $R(x_i)$ is the rank of $x_i$ and $R(y_i)$ is the rank of $y_i$ :

\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)},

where $d_i = R(x_i) - R(y_i)$ .

Kendall's Tau

Kendall's tau is another rank-based measure. It counts the number of concordant vs. discordant pairs in the data. While computationally slightly more involved, Kendall's tau can sometimes be more sensitive in capturing the strength of a monotonic relationship than Spearman's correlation.

Partial Correlation

A partial correlation measures the relationship between two variables while controlling for the effect of one or more additional variables. This is particularly useful in multivariate analyses where the direct relationship between two variables might be confounded by a third.

For instance, if you want to see how "exercise frequency" (X) correlates with "blood pressure" (Y) while removing the effect of "age" (Z), partial correlation helps filter out the influence that age might have on both exercise frequency and blood pressure.

import numpy as np

def partial_correlation(X):
    """
    Compute the partial correlation matrix of X.
    Each column of X is a variable.
    """
    # Invert the covariance matrix
    cov = np.cov(X, rowvar=False)
    inv_cov = np.linalg.inv(cov)

    # Normalize to get partial correlation
    d = np.sqrt(np.diag(inv_cov))
    P = -inv_cov / np.outer(d, d)
    
    # Diagonal entries will be 1 (correlation with itself)
    np.fill_diagonal(P, 1)
    return P

# Example usage
# Suppose we have columns: [exercise_frequency, blood_pressure, age, ...]
data = np.random.rand(100, 3)
pcorr = partial_correlation(data)
print("Partial correlation matrix:\n", pcorr)

The likelihood function

Definition and role in statistical modeling, parameter estimation

In statistical modeling, the likelihood function expresses how likely the observed data are, given a set of parameters and an assumed model. Formally, if we have observations $x_1, x_2, \dots, x_n$ and a statistical model with parameter $\theta$ (which could be a scalar or vector), the likelihood is:

\mathcal{L}(\theta) = p(x_1, x_2, \dots, x_n \mid \theta),

where $p(\cdot \mid \theta)$ denotes the probability (or probability density) of the data under the parameter $\theta$ . In practice, we often work with the log-likelihood:

\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^n \log p(x_i \mid \theta).

Maximizing the log-likelihood (MLE: maximum likelihood estimation) is usually more convenient than maximizing $\mathcal{L}(\theta)$ directly, thanks to the property that the logarithm is a strictly increasing function. This leads to simpler sums (rather than products).

Relationship to probability and inference; examples of likelihood in machine learning models

The likelihood is closely tied to the idea of fitting a parameterized probability distribution to observed data. In parametric machine learning, consider a logistic regression model:

P(y=1 \mid x; \theta) = \sigma(\theta^\top x),

where $\sigma(\cdot)$ is the sigmoid function. We can interpret $\theta$ as the parameters that maximize the likelihood of the observed labels $y$ . This concept generalizes to many models: from Gaussian mixture models to neural networks trained by minimizing cross-entropy (which can be derived as a negative log-likelihood).

Likelihood-based approaches also form the foundation for Bayesian methods, where we combine likelihood with a prior distribution to obtain a posterior via Bayes' theorem.

MLE and MAP

We introduced maximum likelihood estimation (MLE) as a strategy to choose parameters $\theta$ that maximize $p(x \mid \theta)$ . However, in a Bayesian framework, we incorporate a prior distribution $p(\theta)$ over parameters and update this belief in light of new data. This yields the posterior distribution:

p(\theta \mid x) \propto p(x \mid \theta)\, p(\theta).

MLE: $\hat{\theta}_{\text{MLE}} = \mathrm{argmax}_{\theta} \, p(x \mid \theta).$
MAP (Maximum A Posteriori): $\hat{\theta}_{\text{MAP}} = \mathrm{argmax}_{\theta} \, p(\theta \mid x) \propto \mathrm{argmax}_{\theta} \, p(x \mid \theta)\, p(\theta).$

The MAP estimator effectively combines observed data (likelihood) with prior knowledge or beliefs (the prior). When the prior is uninformative (uniform), MAP and MLE coincide. But with informative priors, MAP often yields parameter estimates that incorporate domain expertise or regularization-like effects.

Confidence intervals

Definition of confidence interval, constructing confidence intervals

A confidence interval (CI) gives a range of plausible values for a population parameter based on sample data. A 95% confidence interval, for example, is often constructed so that if we were to repeat the sampling process many times, about 95% of such intervals would contain the true parameter value. One standard construction for a mean with a large sample size (and an approximately normal sample mean) is:

\bar{x} \pm z_{\alpha/2} \frac{s}{\sqrt{n}},

where:

$\bar{x}$ is the sample mean.
$s$ is the sample standard deviation.
$n$ is the sample size.
$z_{\alpha/2}$ is the critical value from the standard normal distribution (e.g., 1.96 for 95% confidence).

Interpreting confidence intervals

A frequent misinterpretation is: "There is a 95% probability the true mean lies in this interval." Strictly speaking, in frequentist terms, the true mean is a fixed quantity. The correct interpretation is: "With repeated sampling and interval construction, 95% of those intervals would contain the true mean."

In practice, we use confidence intervals to convey the uncertainty around an estimate. Narrow intervals indicate higher precision (often due to a larger sample size or lower variability), whereas wide intervals indicate less precision.

Bootstrap Confidence Intervals

While a simple confidence interval for the mean often uses a normal approximation, bootstrap confidence intervals are more flexible and can be applied to many statistics (medians, proportions, correlation coefficients, etc.). The procedure is:

Draw a bootstrap sample (of the same size as the original) with replacement.
Compute the statistic of interest (e.g., sample mean) for this bootstrap sample.
Repeat steps 1–2 many times (e.g., 1,000 or 10,000 replicates).
Use the distribution of the bootstrap replicates to form a confidence interval (e.g., the 2.5th and 97.5th percentiles for a 95% CI).

import numpy as np

def bootstrap_ci(data, func=np.mean, alpha=0.05, n_boot=1000):
    """Compute a bootstrap confidence interval for func(data)."""
    stats = []
    n = len(data)
    for _ in range(n_boot):
        sample = np.random.choice(data, size=n, replace=True)
        stats.append(func(sample))
    
    stats = sorted(stats)
    lower_bound = np.percentile(stats, 100*(alpha/2))
    upper_bound = np.percentile(stats, 100*(1 - alpha/2))
    return lower_bound, upper_bound

# Example usage
samples = np.random.randn(1000)
ci = bootstrap_ci(samples, func=np.median)
print(f"Bootstrap 95% CI for the median: {ci}")

Bootstrap methods can be adapted for various estimators and provide more accurate intervals when parametric assumptions (e.g., normality) are questionable.

Quantiles

Quantiles partition data into segments based on rank order. The median is a special case (the 50th percentile). More generally:

Quartiles: Split the data into four parts (Q1 is the 25th percentile, Q2 is the 50th, Q3 is the 75th).
Percentiles: Split the data into 100 equal parts; the $k$ th percentile is the value at or below which $k$ % of the observations lie.

Definition (median, quartiles, percentiles, etc.)

Formally, the $\alpha$ -quantile $q_\alpha$ of a distribution is the value such that:

P(X \le q_\alpha) = \alpha.

For empirical data, we often approximate quantiles by sorting observations and finding the point(s) in the sorted list that correspond to the fraction $\alpha$ .

Uses in data analysis

Detecting skewness: If the median differs significantly from the mean, it might indicate a skew.
Examining outliers: The 5th and 95th (or 1st and 99th) percentiles often highlight the tails of a distribution. Box plots, for instance, depict quartiles and help spot outliers.
Robust analysis: Median or other quantiles are less sensitive to outliers than the mean.

Interquantile Ranges

Beyond the standard quartiles (25%, 50%, 75%), analysts often use interquantile ranges (e.g., the 10%–90% range) to focus on the "core" portion of the data. This can be particularly informative for skewed distributions or distributions with heavy tails, where a large portion of the data may lie in some lower percentile range.

Quantile Regression

Quantile regression allows you to model specific quantiles (e.g., the median or 90th percentile) of the response variable rather than the mean. This is extremely useful in fields such as finance or economics, where you might care about worst-case (upper quantile) or best-case (lower quantile) scenarios, not just the average outcome.

Density estimation

Density estimation is about constructing an estimate of the probability density function (pdf) from observed data. Unlike parametric approaches (e.g., fitting a normal distribution), non-parametric density estimation makes fewer assumptions about the shape of the distribution.

Kernel density estimation

A popular non-parametric approach is kernel density estimation (KDE). Given data points $x_1, x_2, \dots, x_n$ , the KDE at a point $x$ is often defined as:

\hat{f}(x) = \frac{1}{n h} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right),

where:

$K$ is a kernel function (e.g., Gaussian kernel).
$h$ is the bandwidth (smoothing parameter).
$n$ is the number of data points.

The kernel function $K(\cdot)$ weights the contribution of observations near $x$ . The choice of $h$ greatly influences the smoothness of the estimated density. If $h$ is too large, the estimate is overly smooth (underfitting). If $h$ is too small, the estimate is too "spiky" (overfitting to the sample points).

Practical steps and bandwidth selection

In practice, software libraries (e.g., scikit-learn or seaborn in Python) provide built-in KDE functions with sensible default bandwidth selection methods like Silverman's rule of thumb. Still, it is good to understand the trade-offs:

Rule-of-thumb methods: These are analytical approximations based on the standard deviation of the data and the sample size.
Cross-validation: Treat the bandwidth as a hyperparameter to be tuned by minimizing some error measure (e.g., mean integrated squared error) on a validation set.

Below is a simplified Python snippet illustrating KDE using seaborn:


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate some random data from a mixture of Gaussians
data_part1 = np.random.normal(loc=-2, scale=1, size=300)
data_part2 = np.random.normal(loc=3, scale=0.5, size=200)
data = np.concatenate([data_part1, data_part2])

# Kernel Density Plot
sns.kdeplot(data, shade=True, bw_adjust=1.0)
plt.title("Kernel Density Estimation")
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()

An image was requested, but the frog was found.

Alt: "A KDE plot with bimodal data"

Caption: "An example of KDE on a bimodal dataset."

Error type: missing path

KDE is particularly useful for visualizing data distributions that don't necessarily fit common parametric models. In exploratory data analysis, comparing KDE plots of different groups can reveal differences in distribution shapes, medians, and tails.

Beyond KDE: mixture models

While KDE is a powerful non-parametric method, many real-world distributions are well-modeled by mixtures of simpler parametric families (e.g., Gaussian mixture models). EM Algorithm (Expectation-Maximization) is commonly used to fit mixture models by iteratively refining estimates of parameters (e.g., means, variances, and mixing proportions in a Gaussian mixture). By comparing multiple approaches (e.g., KDE vs. mixture models), practitioners can decide which method best suits the underlying distribution and the interpretability needs of their application. We'll discuss this cool method somewhere down the line in this course, but for now, let me use ChatGPT to write another summary.

Summary

Now we know about:
☝️ Data generation (probability theory): Models how data arise, describing them with distributions (discrete or continuous).
☝️ Data collection (sampling): Acquiring representative samples to make inferences about the population.
☝️ Descriptive statistics: Summarizing data through measures like mean, median, mode, variance, and visual tools like histograms and box plots.
☝️ Inferential statistics: Drawing conclusions about the population through confidence intervals, hypothesis tests, and (soon) regression models.
☝️ A buzzword "machine learning": Applying statistical methods and probability theory to build predictive models, measure uncertainty, and handle complex, high-dimensional data.

We're almost ready to dive into machine learning...

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content