Anomaly detection

Anomaly detection

Detecting novelty and outliers

#️⃣   ⌛  ~1 h 🗿  Beginner

04.03.2023

upd:

#36

Anomaly detection

Detecting novelty and outliers

⌛  ~1 h

#36

🎓 133/167

This post is a part of the Other ML problems & advanced methods educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Anomaly detection, often referred to as outlier detection or novelty detection, is a powerful and widely studied field within data science and machine learning. It encompasses a range of methods whose goal is to identify patterns, observations, or data points that deviate considerably from the rest of the dataset. These deviant observations are what we label as anomalies, outliers, or novelties, depending on the context and the underlying assumptions. In many real-world scenarios — such as fraud detection, system health monitoring, event detection in sensor networks, or performance analysis — anomalies represent significant, unusual, or potentially critical instances that may require investigation.

Historically, anomaly detection has been grounded in the idea of extreme value analysis and robust statistics, often with specialized domain knowledge to manually set thresholds or domain-specific rules. Over time, the prevalence and complexity of data have grown, leading to more sophisticated approaches that leverage both classical statistical inference and machine learning. From purely statistical methods (for instance, fitting a distribution to the data and identifying points in the tails) to more advanced techniques (such as one-class SVMs, isolation forests, autoencoders, and generative models), practitioners have at their disposal a rich arsenal of tools.

One of the earliest seminal works on outliers came from Hawkins (1980), who defined an outlier as "an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism." Additional literature, such as Chandola, Banerjee, and Kumar (2009), provided comprehensive surveys on anomaly detection methods across different domains (e.g., intrusion detection, industrial production lines, and financial transactions). Building on these foundations, modern research includes methods tailored to high-dimensional data, streaming or online scenarios, deep representation learning, and more.

In many respects, anomaly detection overlaps with other fundamental tasks in machine learning, such as classification and clustering. For example, in classification, anomalies can show up as rare samples in a predominantly normal class, and in clustering-based methods, small or distant clusters might be flagged as outliers. Simultaneously, anomaly detection has unique challenges: extreme class imbalances, changing data distributions (in streaming setups), or a lack of labeled data for anomalous events, among many others.

The chapters that follow dive into key concepts, classification systems for anomalies, relevant statistical approaches, modern machine learning methods, feature engineering considerations, evaluation strategies, and real-world applications. Though there are many textbooks, research papers, and software frameworks addressing the topic, the intent here is to provide a deep, coherent overview that organizes the major approaches in a single narrative and highlights both classic and cutting-edge techniques.

In what follows, we will:

Define the principal types of anomalies and how they typically arise.
Discuss the statistical foundations, including parametric and non-parametric strategies for anomaly detection.
Present various machine learning-based methods, from supervised to deep learning approaches.
Outline best practices in feature engineering, dimensionality reduction, and domain-specific transformations.
Describe model evaluation metrics and validation techniques that address class imbalance and other pitfalls.
Present real-world applications, usage scenarios, and available open-source frameworks.

We conclude with a high-level summary of key points that help unify the field, while also pointing to ongoing research challenges and emerging methods.

Types of anomalies

Though the concept of an "anomaly" might seem straightforward, it actually contains nuanced categories that help in selecting the right detection approach. Depending on the data characteristics and the context, the following classification is especially common:

Point anomalies

Point anomalies are the simplest and most frequently encountered anomalies. In this scenario, a single point in the dataset is considered deviant relative to the overall data distribution. For example, imagine monitoring the temperature in a server room that normally oscillates between $18$ and $25$ °C. If one sensor reading randomly shoots up to $90$ °C, it would quite clearly be a point anomaly.

This is often the first type of anomaly encountered in practical contexts because it is straightforward to detect using distance- or density-based methods. Some examples include credit-card fraud detection, in which a single very large purchase might be anomalous, or a sensor-based system that flags a single suspicious reading.

Contextual anomalies

Contextual anomalies (sometimes referred to as conditional anomalies) require contextual information to determine whether a data point is truly anomalous. In other words, the "outlierness" of a point depends on the situation or environment in which it appears.

A quintessential example is temperature readings across different seasons or times. A reading of $15$ °C in summer (in a region typically reaching $30$ °C) might be considered anomalous, but the same $15$ °C reading in winter might be perfectly normal. Similarly, a traffic spike on a website might be anomalous on a normal weekday, but not during a major promotional event or product launch.

Collective anomalies

Collective anomalies (or group anomalies) occur when a set or sequence of data points as a whole is deviant, even though each individual element in that set may not be particularly anomalous in isolation. A classic example is a burst of events in a time series — for instance, multiple back-to-back requests that collectively look suspicious, or a contiguous range of sensor readings in a manufacturing process that deviate together from the standard pattern.

In these scenarios, anomalies are best detected by modeling the temporal or structural dependence of consecutive data points rather than analyzing each point individually. Clusters of network requests or unusual patterns in event logs also fall under this category.

Statistical foundations

Motivation and approach

Statistical methods for anomaly detection aim to characterize the typical distribution of data and then identify any observations that appear inconsistent with the assumed distribution. These methods can be either parametric (where a specific distribution like the Gaussian or Poisson distribution is assumed) or non-parametric (where we forego strict assumptions about distributions and rely on data-driven techniques).

Such approaches are historically one of the earliest ways to detect outliers. They remain relevant due to their interpretability and ease of implementation. Moreover, they are often computationally lighter than more advanced ML methods, making them suitable for smaller datasets or well-understood data domains.

Distribution-based methods

In distribution-based anomaly detection, we assume that "normal" data follows a specific distribution (e.g., normal, exponential, gamma). We then fit the parameters of that distribution (mean, variance, etc.) on the data that is presumed to be normal. Any point that significantly deviates, based on a probabilistic threshold, is flagged as an anomaly.

For instance, let $X \in \mathbb{R}^d$ denote a random variable representing the data (with $d$ features). Assume we fit a multivariate Gaussian distribution with mean vector $\mu$ and covariance matrix $\Sigma$ . Under that assumption, the log-likelihood for a point $x$ is:

\log p(x) = - \frac{d}{2}\log(2\pi) - \frac{1}{2}\log(\det \Sigma) - \frac{1}{2}(x-\mu)^\top \Sigma^{-1} (x-\mu).

A threshold $\tau$ might be chosen (based on domain knowledge or statistical significance) such that if $\log p(x) < \tau$ , we label $x$ as an anomaly.

Variables here:

$x$ is the data point in $d$ -dimensional space.
$\mu$ is the estimated mean vector.
$\Sigma$ is the estimated covariance matrix.
$\det \Sigma$ is the determinant of $\Sigma$ .
$\Sigma^{-1}$ is the inverse of the covariance matrix.
$\tau$ is a threshold chosen either by cross-validation or a significance level approach.

Alternatively, one might look at the Mahalanobis distance, $D_M$ , defined as:

D_{M}(x) = \sqrt{(x-\mu)^\top \Sigma^{-1} (x-\mu)},

and if $D_M(x)$ is too large compared to what is "typical," $x$ can be treated as anomalous.

Z-score and thresholding approaches

A simpler but still widely used approach is based on the z-score, which applies primarily to univariate data. Here, we compute:

z_i = \frac{x_i - \bar{x}}{\sigma},

where $x_i$ is the $i$ -th observation, $\bar{x}$ is the sample mean, and $\sigma$ is the sample standard deviation. Any observation that satisfies $|z_i| > k$ for some threshold $k$ (commonly 3 or 4) might be flagged as an outlier.

However, in multivariate or high-dimensional contexts, the z-score is less useful, and we typically rely on more elaborate distributional assumptions or robust alternatives (e.g., robust estimation of location and scale).

Parametric vs. non-parametric techniques

In parametric methods, we require a predefined distribution or family of distributions. For instance, we might assume that normal operating temperatures in an industrial sensor system follow a Gaussian distribution or a mixture of Gaussians. If the data is large and truly adheres to these assumptions, parametric methods can be extremely efficient and interpretable.

Non-parametric methods, on the other hand, do not assume a specific distribution. Instead, they use data-driven strategies such as kernel density estimation (KDE) or rank-based tests. KDE, for instance, estimates the underlying probability density function by summing individual kernel functions centered on each data point:

\hat{p}(x) = \frac{1}{n\,h^d} \sum_{i=1}^n K \Bigl( \frac{x - x_i}{h} \Bigr),

where $n$ is the number of samples, $h$ is the bandwidth (a smoothing parameter), and $K$ is a chosen kernel function (commonly Gaussian). Points $x$ for which $\hat{p}(x)$ is exceedingly small can be flagged as anomalies.

Variables here:

$n$ is the total number of reference points in the dataset.
$d$ is the dimensionality of $x$ .
$h$ is the kernel bandwidth controlling how wide each kernel is spread.
$K$ is typically a symmetric function integrated to 1, e.g. a Gaussian kernel.

Time-series considerations

Anomaly detection in time-series settings introduces complexities such as trends, seasonality, and serial correlations. A purely distribution-based method that ignores time or ordering might misclassify cyclical or repeated patterns as anomalies. Thus, domain-aware approaches (e.g., ARIMA-based residual analysis, decomposition-based outlier detection) have been developed.

In classical time-series anomaly detection, a model such as ARIMA is fit to the series, producing forecasts and confidence intervals. Observations lying far outside the forecast envelope are flagged. Alternatively, robust decomposition can separate the series into trend, seasonality, and remainder components, with anomalies identified in the remainder if they exceed certain thresholds.

Machine learning methods

Machine learning-based anomaly detection methods often revolve around learning a representation of "normal" data and then scoring or classifying new observations as normal or abnormal. We can broadly categorize these approaches into supervised, unsupervised, semi-supervised, and deep learning-based techniques.

4.1. Supervised anomaly detection

When labeled data is available and we have examples of both normal instances and anomalies, one can treat anomaly detection as a supervised classification problem. However, real-life anomaly detection often suffers from severe class imbalance: anomalies can be extremely rare, and in many domains, large sets of labeled anomalies do not exist. Nonetheless, when some labeled anomalies are indeed available, standard classification approaches (e.g., random forests, support vector machines, gradient boosting) can be adapted with specialized strategies for imbalanced classification:

Oversampling the minority class with methods like SMOTE (Synthetic Minority Over-sampling TEchnique).
Undersampling the majority class.
Incorporating class weight adjustments in the classifier's training loss.
Using specialized metrics such as the F2-Score, AUROC, or PRAUC that focus more on minority-class performance.

An example challenge with supervised learning arises when the anomalies are highly heterogeneous. In such a scenario, a single "anomaly class" might not represent the full variety of abnormal patterns. Another challenge is that new, previously unseen anomalies might go undetected.

Performance challenges with imbalanced data

Classification-based approaches for anomaly detection typically revolve around having many more normal examples than anomalies. If a naive model simply classifies all points as normal, it may achieve a deceptively high overall accuracy. Therefore, the focus shifts to metrics like:

Precision, recall, and F1 (or F2) specifically for the anomaly class.
The area under the precision-recall curve (PR AUC), which is particularly informative under class imbalance.
The area under the ROC curve (ROC AUC), though it may sometimes overestimate performance in extremely imbalanced settings.

4.2. Unsupervised anomaly detection

Unsupervised methods constitute a major portion of anomaly detection algorithms, precisely because they do not require labels. They rely on the assumption that anomalies are rare and different from the bulk of the data.

Clustering-based methods

Since anomalies are typically distinct or sparse, small or distant clusters in a clustering solution can be flagged as anomalies. For instance, with K-means, if a point's distance to the assigned cluster centroid is too large compared to an empirical threshold, it might be labeled as an anomaly. Alternatively, hierarchical clustering or density-based clustering (e.g., DBSCAN) can identify small clusters with few points as anomalies.

Nonetheless, one must be mindful that if the dataset has multiple natural clusters of normal points, naive clustering-based approaches might misclassify valid clusters with fewer points as anomalies. Hybrid strategies or domain-driven thresholds can mitigate this.

Density estimation

Density-based approaches, such as Local Outlier Factor (LOF), measure how isolated a point is with respect to its neighbors in the feature space. In LOF, each data point's local density is compared to the densities of its nearest neighbors. Points with significantly lower local densities than their neighbors receive higher outlier scores.

Other well-known methods include robust kernel density estimation or the aforementioned parametric and non-parametric distribution modeling. However, in high dimensions, distance measures and density estimation can become less reliable due to the "curse of dimensionality."

4.3. Semi-supervised techniques

Semi-supervised techniques in anomaly detection typically assume that we have access to a set of "normal" training samples but lack reliable labels for anomalies. The model attempts to learn a compact representation of normal behavior and flags deviations from it as anomalies.

One-class SVM

One-class SVM is a powerful semi-supervised approach introduced in the late 1990s and widely popularized in the 2000s. It tries to separate the origin (in feature space) from the data with a maximum margin boundary, effectively isolating most of the data points in a particular region. New points lying outside this region are labeled as anomalies or novelties.

In scikit-learn's implementation (sklearn.svm.OneClassSVM), the RBF (radial basis function) kernel is commonly used. The parameter $\nu$ controls the upper bound on the fraction of training errors (i.e., the fraction of outliers) and influences the margin. This method is especially well-suited for novelty detection — it learns from a training set that is assumed to be largely free of anomalies and then aims to detect anomalies in future data.

Isolation forest

Isolation forest is another widely adopted algorithm. Rather than modeling the distribution of normal data, it explicitly attempts to isolate anomalies by selecting features and thresholds at random to recursively partition the feature space. Anomalies tend to be easier to isolate because they typically require fewer splits:

Build an ensemble of random trees where each tree is grown by randomly selecting features and splitting values.
For a given point $x$ , measure the average path length across all trees.
Points with shorter average path lengths are considered anomalies (easier to isolate).

This idea resonates with the broader concept of ensemble learning: if many random partitions isolate a point early, it is likely abnormal. Isolation forests are relatively fast and can handle high-dimensional datasets.

Other semi-supervised approaches

RANSAC (RANdom SAmple Consensus) can be used to robustly fit a model to the majority of the data. Points that deviate significantly from the fitted model are flagged as outliers or anomalies.
Elliptic envelope (in scikit-learn) models data as a Gaussian distribution and finds the "envelope" that fits most points. Points outside this ellipse are considered anomalies. This works best for roughly unimodal, elliptical distributions.

4.4. Deep learning-based methods

In recent years, the anomaly detection landscape has been significantly influenced by deep learning research, offering advanced ways to learn representations in complex, high-dimensional data such as images, time series, and sensor networks.

Autoencoders for reconstruction error

A popular approach is to train an autoencoder on "normal" data. The autoencoder, a neural network architecture with a bottleneck hidden layer, learns to reconstruct normal data. The reconstruction error is then used as an anomaly score:

Train the autoencoder using only normal examples (or mostly normal).
For new data, measure how well the autoencoder reconstructs the input.
If the reconstruction error $\lVert x - \hat{x} \rVert$ is above a certain threshold, the point is labeled as an anomaly.

Such approaches have been utilized extensively in fraud detection, defect detection in images, and industrial IoT sensor monitoring. Variants such as denoising autoencoders, variational autoencoders, and convolutional autoencoders are also widely applied.

Generative adversarial networks (GANs)

GAN-based anomaly detection methods (e.g., AnoGAN) train a generator-discriminator pair on normal data. The assumption is that the generator will fail to produce convincing replicas for anomalous inputs, while the discriminator's output or an auxiliary error metric can be used to detect anomalies. Although more complex to train, GANs have shown promising results in vision and other data modalities.

A high-level outline:

Train a GAN using only normal data.
For a new data point, find the latent representation $z$ in the generator's space that best reconstructs the data point.
If the reconstruction is poor or the discriminator's confidence is low, classify the point as an anomaly.

Deep learning-based solutions can require substantial data and computational resources, yet they offer unmatched flexibility when dealing with unstructured or high-dimensional data like images, audio signals, or complex time series. Recent research has also explored graph neural networks for anomaly detection in graph-structured data, LSTM-based detection for sequential data, and transformer-based anomaly detection for complicated temporal patterns.

Feature engineering and dimensionality reduction

High-dimensional data complicates anomaly detection for many algorithms, particularly distance- and density-based methods, due to the so-called curse of dimensionality. Consequently, thoughtful feature engineering can be crucial to success.

5.1. Selecting and scaling features

In many anomaly detection scenarios, the appropriate choice and preprocessing of features is as essential as the choice of the detection algorithm:

Normalization or standardization: E.g., subtract mean, divide by standard deviation for each feature, especially important for distance-based methods.
Robust scaling: Re-scaling with robust measures such as median and interquartile range can mitigate the influence of outliers on feature scaling.
Domain-driven features: Domain knowledge often reveals specialized transformations. For instance, log-scaling or power transformations for data with heavy tails.

5.2. Dimensionality reduction (PCA, t-SNE, UMAP)

Reducing dimensionality can help highlight potential clusters of normal data or separate outliers in a latent projection.

PCA: A linear transformation that captures the directions of maximal variance. Points lying far from the principal subspace can be labeled as anomalies. The $Q$ -statistic or the reconstruction error in PCA is sometimes used as an outlier score.
t-SNE: A non-linear dimensionality reduction method often used for visualization. Although t-SNE is primarily a visualization tool, it can help practitioners spot outliers in a 2D embedding.
UMAP: Another non-linear approach that preserves both local and global data structure. Like t-SNE, it is widely used for exploration and outlier discovery.

5.3. Domain-specific feature transformations

Financial transactions might require ratio features (e.g., comparing an expense to a user's average monthly expense). Network traffic anomalies might require features derived from packet sizes, protocol usage, or temporal frequency. Similarly, sensor data might benefit from aggregated features such as rolling averages, differences, or wavelet transform coefficients.

In time-series anomaly detection, transformations that isolate cyclical or seasonal components can be helpful, as can features capturing changes from one time step to the next. The right transformation can reduce noise and highlight patterns that separate normal from abnormal.

Model evaluation and validation

Evaluating anomaly detection models is challenging because of the extreme class imbalance, potential absence of labeled anomalies, and the possibility of new types of anomalies appearing in the future. Nevertheless, certain strategies and metrics apply generally:

Precision, recall, and F1-score for anomalies, if we have ground truth labels.
ROC AUC or PR AUC if label distribution or threshold tuning is a concern. PR AUC is more insightful when positives (anomalies) are rare.
Cross-validation: K-fold cross-validation can be adapted to measure outlier detection performance, although ensuring a consistent ratio of anomalies in each fold can be tricky if anomalies are extremely rare.
Time-series splits: In time-series anomaly detection, standard random folds can lead to information leakage. Instead, one uses rolling or expanding window evaluation.

Moreover, when ground truth is scarce, domain expert feedback is invaluable to confirm or reject flagged anomalies. In certain industries, the cost of a missed anomaly might far exceed the cost of a false alarm, so the trade-off between false positives and false negatives must be carefully balanced.

Handling class imbalance

Class imbalance is a persistent issue. In a typical industrial scenario, anomalies may comprise well below 1% of the data. Techniques include:

Synthetic anomaly generation: If domain knowledge allows, artificially creating plausible anomalies can help in training or validating.
Cost-sensitive methods: Adjust the cost function to penalize missed anomalies more heavily.
Oversampling the known anomalies or undersampling the normal data, though care must be taken not to lose valuable information about normal patterns.

When unlabeled data is abundant, semi-supervised methods that learn normal patterns remain highly attractive.

Cases, applications, Python tools, examples

Anomaly detection is used across a variety of fields:

Credit card fraud: Real-time monitoring of transactions to detect potential fraud.
Intrusion detection: Identifying unauthorized access or suspicious behavior on networks.
Equipment health: Monitoring sensor data for unexpected vibrations, temperatures, or other signals that deviate from known healthy patterns.
Healthcare: Flagging unusual patient vital signs, new disease symptoms, or rare medical conditions.
Manufacturing: Spotting production irregularities, defective products, or malfunctioning machines.

Below is a quick Python code snippet illustrating how one might use isolation forests in scikit-learn:

<Code text={`
import numpy as np
from sklearn.ensemble import IsolationForest

# Example: generate a small synthetic dataset
rng = np.random.RandomState(42)
# Normal data around mean=0, standard deviation=1
X_normal = 0.3 * rng.randn(100, 2)
X_normal = np.r_[X_normal + 2, X_normal - 2]

# Add some outliers
X_outliers = rng.uniform(low=-6, high=6, size=(20, 2))

X = np.concatenate([X_normal, X_outliers], axis=0)

# Fit isolation forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
iso_forest.fit(X)

# Predict anomalies (-1 means outlier, 1 means inlier)
preds = iso_forest.predict(X)
scores = iso_forest.decision_function(X)

print("Predictions:", preds)
print("Anomaly scores:", scores)
`}/>

Here, we generate normal data in two clusters (centered at $\pm 2$ ) and a set of uniform outliers. We train an Isolation Forest with a contamination ratio of 0.1 (assuming ~10% anomalies). Points receiving a prediction of -1 are outliers, while 1 indicates normal.

Many other popular tools and libraries exist:

scikit-learn includes OneClassSVM, IsolationForest, LocalOutlierFactor, and EllipticEnvelope.
PyOD (Python Outlier Detection) is a comprehensive library with numerous algorithms, including advanced ensembles and deep learning-based methods.
TensorFlow and PyTorch can be used to build custom neural architectures for anomaly detection.

Tools, frameworks, and libraries

Besides the scikit-learn ecosystem, the following are noteworthy:

PyOD: One of the most extensive open-source Python libraries solely dedicated to anomaly detection, featuring well-known algorithms (Isolation Forest, LOF, One-Class SVM, etc.) and advanced ones (like COPOD, ECOD, SUOD).
River: A Python library for online/streaming anomaly detection, useful when data arrives continuously and the model must update in real time.
MLlib in Apache Spark: Provides distributed implementations of a few anomaly detection or outlier detection algorithms, suitable for big data.
H2O.ai: Offers an extensive platform with anomaly detection functionalities, often integrated with GPU acceleration and automated machine learning features.

For specialized domains — like anomaly detection in graphs or images — libraries such as PyTorch Geometric (for graph neural networks) or integrated frameworks for computer vision can be harnessed to build custom solutions.

Conclusion

Anomaly detection is an indispensable technique in data science and machine learning, powering solutions ranging from fraud prevention and cybersecurity to industrial monitoring and healthcare diagnostics. By defining an anomaly as something that does not conform to an established "normal" pattern, practitioners must grapple with the inherent imbalance and diversity of anomalies.

From a historical perspective, statistical methods (z-score, parametric distribution fitting, robust statistics) laid the groundwork and remain crucial whenever domain assumptions apply or data is limited. Over time, unsupervised and semi-supervised machine learning approaches have brought more flexible, data-driven solutions, especially in settings where labeled anomalies are scarce or incomplete. Advanced algorithms such as one-class SVM, isolation forests, and local outlier factor have become mainstays, given their interpretability and relatively high performance across different domains.

Simultaneously, deep learning has opened the door for new paradigms: learning a compressed representation of normal data (via autoencoders or generative models) can greatly amplify our ability to highlight anomalies in complex, high-dimensional data. This area continues to be one of active research, with developments in transformer-based or diffusion-based anomaly detection for time series, images, and text data.

Developing a successful anomaly detection solution requires careful planning, including:

Understanding the data domain, the types of anomalies of interest, and the implications of false positives versus false negatives.
Engineering appropriate features and scaling methods that mitigate the curse of dimensionality and highlight distinguishing characteristics.
Employing robust validation and evaluation strategies that take imbalance and domain feedback into account.

Although numerous open-source frameworks offer a wealth of ready-to-use anomaly detection algorithms, domain expertise often remains the linchpin of success. Cleverly chosen features and well-considered thresholds can be more critical than the choice between, say, isolation forest and local outlier factor.

Looking ahead, research trends include self-supervised learning for anomaly detection (leveraging large unlabeled datasets to learn robust representations), few-shot and active learning approaches for leveraging small sets of labeled anomalies, and scalable distributed detection methods for massive streaming data. Together, these developments underscore the vibrant and evolving nature of anomaly detection as a discipline — one that continues to play a vital role in safeguarding systems, processes, and data-driven decision-making across the globe.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content