K-nearest neighbors

K-nearest neighbors

Old but gold

#️⃣   ⌛  ~1 h 🗿  Beginner

14.12.2022

upd:

#26

K-nearest neighbors

Old but gold

⌛  ~1 h

#26

🎓 31/167

This post is a part of the Classification basics & ensembling educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

The k-nearest neighbors (KNN) algorithm is a foundational technique in both classical and modern machine learning, valued for its conceptual simplicity, intuitive appeal, and effectiveness in certain application domains. In essence, this model belongs to the family of instance-based or lazy-learning methods. The key idea is to classify or predict the label of a new data point by looking at its "neighbors" in the feature space — that is, the most similar or closest points from the training set — and using their labels to make an inference.

This chapter dives into the mathematics, intuition, and theoretical properties that underlie the KNN approach. We will also discuss the bias-variance tradeoff for KNN, potential pitfalls, convergence guarantees, and references to both traditional and cutting-edge research.

Overview of k-nearest neighbors

At a high level, KNN presupposes that info Objects that are near each other in feature space are likely to share the same or similar labels. This often goes hand-in-hand with the so-called "smoothness" or "compactness" assumption: if two samples are sufficiently close under a suitable distance metric, they should belong to the same class (in classification) or have similar numeric values (in regression).

Concretely, the classification version of KNN works as follows:

You are given a training set of $m$ labeled samples, $\{(x_1, y_1), (x_2, y_2), \ldots, (x_m, y_m)\}$ . Each $x_i$ is typically an $n$ -dimensional feature vector, and each $y_i$ is the label or target.
You have a distance or similarity measure $\rho(\cdot,\cdot)$ . For example, Euclidean distance is commonly used, but other metrics can be substituted.
To predict the label of a new, unseen sample $x$ , you compute the distance from $x$ to all training samples $x_i$ .
You select the $k$ nearest points (i.e., the $k$ samples in the training set with the smallest distances to $x$ ).
You aggregate the labels (for classification) or numeric targets (for regression) of those $k$ points in order to make your prediction.

Mathematically, for classification, a general KNN-based predictor $a(u)$ for a new point $u$ can be written as:

a(u) = \mathrm{arg}\max_{y \in Y} \sum_{i=1}^{m} \bigl[\, y_{i;u} = y \bigr] \, w(i,u),

where $y_{i;u}$ denotes the label of the $i$ -th nearest neighbor of $u$ , and $w(i,u)$ is a weight function that determines how much the $i$ -th neighbor contributes to the classification of $u$ . The simplest scenario is:

Unweighted / majority vote KNN: $w(i,u) = [i \leq k]$ , meaning that we only count the top $k$ neighbors equally, ignoring all others.
Weighted KNN: The neighbors are weighted inversely by their distance to $u$ , or perhaps by a kernel function, ensuring that closer neighbors exert a stronger influence.

For regression, the logic is much the same, except we compute (for instance) the mean of the $k$ nearest neighbors' numeric values, potentially weighting them as a function of distance.

Basic principles

KNN is considered a lazy learner because it defers computation until inference time (classification or regression). There is no explicit training process in the same sense as parameter-based models (e.g., linear or logistic regression, neural networks, etc.). Instead, the entire training set is often stored, and queries require scanning or indexing that training set. This "train-later, predict-now" characteristic has implications for both memory consumption and computational cost at inference.

The underpinnings of KNN's consistency results (the fact that it can perform well asymptotically) trace back to the seminal work of T. Cover and P. Hart from 1967, who proved that the 1-nearest neighbor classifier is guaranteed to achieve at most twice the Bayes error rate in the limit of infinite samples (Cover and Hart, 1967, IEEE Transactions on Information Theory).

Mathematical formulation

A rigorous way to describe KNN is to define, for each new input $x$ , the set $S_x \subseteq D$ of its $k$ nearest neighbors. Concretely, $S_x$ is chosen so that $\lvert S_x \rvert = k$ and every point not in $S_x$ is at least as far from $x$ as the farthest point in $S_x$ . The KNN classifier $h$ can then be written as:

h(x) = \text{mode} \Bigl\{\, y'' \;|\; (x'', y'') \in S_x \Bigr\}.

That is, you select the label that appears most frequently among the $k$ neighbors.

Bias-variance tradeoff in KNN

In KNN, the bias and variance are primarily controlled by $k$ and by the choice of distance metric. Roughly speaking:

For very small $k$ (e.g., $k = 1$ ), the model has low bias (it can fit extremely local structures) but can have high variance, as it is sensitive to noise or outlier points.
For larger $k$ , the model has higher bias (it effectively takes broader neighborhoods, smoothing out local fluctuations) and lower variance, resulting in more stable predictions.

Because KNN has no parametric form, it can adapt to very complex decision boundaries. However, in high-dimensional spaces or in the presence of very large training data sets, naive KNN can become computationally expensive.

Distance metrics

One of the most critical components in KNN is the distance metric or similarity measure. The standard approach might be Euclidean distance, but the real power (and risk) of KNN lies in how well the chosen distance metric reflects semantic similarity among data points.

Euclidean distance

Often the default choice, the Euclidean distance between two vectors $x = (x_1,\ldots,x_n)$ and $y = (y_1,\ldots,y_n)$ in an $n$ -dimensional space is:

\rho_{\text{euclid}}(x, y) \;=\; \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}.

Euclidean distance has an intuitive geometric interpretation: the length of the straight line between the points in Euclidean space. However, it can be very sensitive to magnitude differences among features. To mitigate this, you often normalize or standardize each feature dimension so that they are on a comparable scale.

Manhattan distance

Also called the L1 metric or "taxicab" metric, the Manhattan distance is:

\rho_{\text{manhattan}}(x, y) \;=\; \sum_{i=1}^n \lvert x_i - y_i \rvert.

When you have reason to assume that coordinate-wise differences are more robust or meaningful for your task, or if you want the distance measure to be less sensitive to outliers in a single dimension, Manhattan distance can be beneficial.

Minkowski distance

The Minkowski class of metrics generalizes multiple norms in a single formula:

\rho_{p}(x, y) \;=\; \Bigl(\sum_{i=1}^n \lvert x_i - y_i \rvert^p \Bigr)^{1/p}.

$p=1$ : recovers the Manhattan distance
$p=2$ : recovers the Euclidean distance
$p \to \infty$ : recovers the Chebyshev distance (the maximum absolute difference along any dimension)

This flexibility helps you experiment with different norms. Each might capture subtle differences in how distance is measured. However, one must carefully tune $p$ because the data geometry can vary drastically with different norms.

Cosine similarity

Cosine similarity (or equivalently, cosine distance if you transform it into a distance measure) focuses on the angle between two vectors rather than their magnitude. It is commonly used in high-dimensional and sparse contexts such as text mining or natural language processing. If $x$ and $y$ are real vectors in $\mathbb{R}^n$ , the cosine similarity is:

\text{sim}_{\cos}(x,y) \;=\; \frac{x \cdot y}{\|x\| \,\|y\|}.

One can convert this similarity into a distance by $\rho_{\cos}(x,y) = 1 - \text{sim}_{\cos}(x,y)$ . This is especially relevant in fields like information retrieval, textual clustering, or word embeddings, where directional closeness in the vector space is more important than raw Euclidean magnitude.

Impact of choosing different metrics

Your choice of metric can profoundly affect performance. For example, with high-dimensional data, Euclidean distance might become less meaningful (the "curse of dimensionality"), so sometimes people turn to alternatives (e.g., Mahalanobis distance with an appropriate covariance matrix, or even learned distance metrics like LMNN or FaceNet embeddings). Real-world success with KNN depends heavily on whether the distance measure aligns with how classes or underlying patterns separate in the feature space.

Classification with KNN

Decision boundaries

A classic visual of KNN is the non-linear decision boundary that forms when you partition the feature space based on nearest neighbors. Because no explicit parametric shape is assumed, the boundary can be quite complicated.

If you imagine a 2D or 3D feature space, you can see that for each point $x$ , KNN effectively claims: "the decision region around $x$ belongs to whichever class the majority of the $k$ neighbors from the training data represent." The resulting boundary can be highly irregular and sensitive to noise if $k$ is small, or overly smooth (missing local details) if $k$ is large.

An image was requested, but the frog was found.

Alt: "knn-decision-boundary"

Caption: "An illustrative 2D example of a KNN decision boundary, where each region is colored according to its predicted class."

Error type: missing path

Choosing the optimal value of k

Selecting $k$ is crucial:

$k=1$ can perfectly classify training data but is prone to overfitting and very sensitive to noise.
Larger $k$ reduces variance but can increase bias.
There is often a sweet spot that balances the two extremes.

In practice, one typically performs cross-validation (e.g., grid search over candidate values) to find the optimal $k$ that yields the best average validation accuracy.

Weighted KNN for improved accuracy

One refinement is to weight each neighbor's contribution by a function of distance, so that closer neighbors have a stronger vote. A simple approach is:

w_i \;=\; \frac{1}{\rho(x, x_i) + \epsilon},

where $\rho(\cdot,\cdot)$ is the chosen metric and $\epsilon$ is a small constant to avoid dividing by zero. More sophisticated approaches include kernels like the Gaussian, tricube, or other smoothing kernels, reminiscent of Parzen window estimators.

Handling class imbalances

KNN on imbalanced data sets can be dominated by the majority class. Some strategies to address this:

Resampling the dataset (e.g., oversampling the minority class, undersampling the majority class).
Applying class-dependent weighting to the KNN distances or to the final decision function.
Combining KNN with anomaly or outlier detection so minority points do not get "washed out" by the majority.

Sometimes, SMOTE or other synthetic oversampling can help correct severe imbalance prior to fitting a KNN-based classifier.

Regression with KNN

KNN extends naturally to regression by considering the real-valued outputs of neighbors and combining them, typically via a weighted average:

Averaging approaches

A straightforward approach is to take the simple mean of the $k$ neighbors' numeric outputs:

\hat{y}(x) \;=\; \frac{1}{k} \sum_{i \in N_k(x)} y_i,

where $N_k(x)$ is the set of indices for the $k$ nearest neighbors. This yields a local, non-parametric estimate that can adapt to quite complicated data relationships.

Weighted methods for regression

Analogous to the classification setting, you can weigh each neighbor's contribution by a function of distance. A common kernel-based approach is:

\hat{y}(x) = \frac{\sum_{i \in N_k(x)} K\!\bigl(\rho(x,x_i)\bigr)\, y_i}{\sum_{i \in N_k(x)} K\!\bigl(\rho(x,x_i)\bigr)}.

In practice, you might define $K(r) = \exp(-\alpha\, r^2)$ or some other shape. Weighted averaging can substantially enhance performance when the data's underlying relationship is fairly smooth and local neighborhoods are meaningful.

Advanced techniques

Dimensionality reduction and feature selection

A recurring difficulty in KNN is the curse of dimensionality. Distances in very high-dimensional spaces can be misleading (points tend to appear equidistant), making the notion of "nearest" less useful. Common strategies:

Dimensionality reduction (PCA, t-SNE, or more advanced methods) to project to a lower-dimensional subspace where the data is more compact.
Feature selection: systematically remove features that do not help discriminate or that add noise. In a real pipeline, you might apply domain knowledge or use wrapper/filter methods to identify a subset of meaningful features.

Dealing with high-dimensional data

When you cannot easily reduce dimensionality, you may turn to specialized metrics or learned embeddings that better reflect the notion of similarity for your domain. For instance, in computer vision tasks (like face recognition), networks such as FaceNet or ArcFace learn to embed images into a space where Euclidean distance correlates strongly with identity. Then, performing a KNN search in that space can work surprisingly well, even though the raw pixel dimension is extremely large.

Approximate nearest neighbor searches

Because naive KNN queries require comparing the test point to all training points (O(n) complexity per query), large-scale usage can be prohibitive. This has motivated advanced data structures and algorithms that efficiently retrieve approximate neighbors:

k-d trees
Ball trees
Product quantization (PQ)
Inverted file indices
Hierarchical Navigable Small World (HNSW)

Approximate nearest neighbor (ANN) methods trade off a small accuracy penalty for substantial speed boosts. They are critical in real-time systems such as image retrieval, recommendation, or large-scale clustering.

Hierarchical navigable small world (HNSW)

The HNSW data structure (Malkov and Yashunin, 2018) is a sophisticated graph-based ANN technique. It organizes data points in a multi-layer "small-world" graph where edges provide both short-range and random long-range connections, ensuring that queries can be routed quickly to relevant regions of the graph. The search proceeds greedily layer by layer, drastically reducing search times compared to naive exhaustive scans.

The overall structure is layered as follows:

Level 0 (lowest) contains all $N$ points.
Each higher level is a sparser subset of points, randomly chosen.
Searching begins at the top layer with a random entry point, then descends layer by layer, greedily moving closer to the query vector. On the final layer, it gathers candidate neighbors for a more refined local search.

The net effect is near logarithmic complexity in many practical cases, enabling KNN queries on extremely large data sets (hundreds of millions or even billions of points) with feasible latency.

Code snippet (example in Python with hnswlib)

An image was requested, but the frog was found.

Alt: "HNSW Graph Diagram"

Caption: "A hierarchical navigable small world structure for approximate nearest neighbor search."

Error type: missing path

Below is an illustrative snippet using the hnswlib library in Python:


import hnswlib
import numpy as np

# Dimensionality of the vectors.
dim = 128

# Number of elements (example).
num_elements = 10000

# Create random data.
data = np.float32(np.random.random((num_elements, dim)))
labels = np.arange(num_elements)

# Instantiate an HNSW index in L2 (Euclidean) space.
p = hnswlib.Index(space='l2', dim=dim)  

# Initialize the index
p.init_index(max_elements=num_elements, ef_construction=200, M=16)

# Add the data
p.add_items(data, labels)

# Set ef parameter for queries (controls recall/quality vs. speed).
p.set_ef(50)  # ef must be >= k

# Query for the nearest neighbor
neighbors, distances = p.knn_query(data, k=1)
print("Approximate neighbors shape:", neighbors.shape)
print("Distances shape:", distances.shape)

This snippet illustrates how easy it is to build an HNSW graph structure and query neighbors with library support. In production, you might tweak $M$ and $ef$ to balance performance and accuracy.

Optimizing KNN for scalability

Beyond approximate methods, additional strategies to make KNN scalable include:

Data compression or prototype selection: Removing redundant points, or representing clusters of points by their centroids (or medoids).
Specialized hardware acceleration: Exploit GPU or vectorized instructions to speed up distance calculations.
Distributed or parallel indexing: For truly massive data sets, frameworks like Apache Spark or HPC libraries can parallelize neighbor searches.

Practical considerations

Data preprocessing (normalization, missing values)

Data preprocessing is essential with KNN. Common guidelines:

Normalization: Transform each feature so that they all lie in similar ranges (e.g., 0 to 1 or $-1$ to $+1$ ), or so they have zero mean and unit variance. This prevents large-valued features from overpowering the distance measure.
Handling missing data: The distance metric must be defined carefully. Some practitioners impute missing values (e.g., mean imputation), while more advanced approaches skip distance computations on missing dimensions or infer them in a more sophisticated manner.

Memory usage and computational complexity

The classic (exact) KNN requires:

Storing all training data (memory overhead).
A query time of $O(m)$ per test sample, where $m$ is the training set size.

As $m$ grows large, naive KNN quickly becomes impractical. Even for moderate $m$ , repeated queries can be expensive. Approximate methods or advanced data structures become critical.

Parameter tuning and cross-validation

KNN has relatively few hyperparameters:

$k$ — the number of neighbors
$p$ in Minkowski distance (if relevant)
Possibly weighting or kernel function parameters

A typical best practice is to run cross-validation (e.g., five-fold or ten-fold) across a grid of possible $k$ values (like 1, 3, 5, 7, ..., 31, etc.), as well as exploring different distance metrics or weighting schemes. You pick the combination that yields the highest validation accuracy (in classification) or the lowest MSE/MAE (in regression).

Handling noise and outliers

KNN can be very sensitive to noisy or mislabeled points, especially if $k$ is small. Some common practices:

Filtering or cleaning suspicious training points with domain knowledge.
Using STOLP or other "prototype selection" methods to remove outliers that degrade classification.
Weighted KNN can reduce outlier impact by diminishing influence with distance.

Model interpretability and explainability

KNN is sometimes praised for interpretability, at least from a "reasoning by analogy" perspective: "We predict the label of $x$ to be Y because its nearest neighbors are labeled Y." That said, with a huge training set, it is less trivial to interpret or visualize the actual boundaries. The local explanation, however, remains straightforward: you can show examples in the training data that are close in distance to the new data point.

Implementation

Below is a short example in Python (using scikit-learn) for KNN classification. We assume you have data $X$ (features) and $y$ (labels), plus any needed preprocessing.


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Suppose X is your feature matrix of shape (m, n)
# Suppose y is a vector of class labels of shape (m,)
# Example: You might load from a CSV or a dataset
# X, y = ...

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# It's often useful to scale the features for KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# Create a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='distance')

# Fit on training data
knn.fit(X_train_scaled, y_train)

# Predict on test data
y_pred = knn.predict(X_test_scaled)

# Evaluate
print(classification_report(y_test, y_pred))

This code snippet demonstrates a typical scikit-learn workflow:

Train/test split.
Optional but recommended feature scaling (StandardScaler).
Instantiating KNeighborsClassifier with your chosen hyperparameters ( $k$ = 5 in this example).
Evaluating predictions with standard metrics.

For KNN regression, scikit-learn provides a KNeighborsRegressor, used similarly but returning continuous outputs.

Use cases

Although KNN's naive variant can be slow for huge data sets, it remains a versatile tool in many domains. Here are some notable applications:

Image recognition and computer vision
- Face recognition is a canonical example: one can embed faces into a lower-dimensional space (e.g., 128D from FaceNet) and then use KNN to classify or retrieve the identity.
- Content-based image retrieval often relies on KNN queries in a feature embedding space.
Recommendation systems
- "User-based collaborative filtering" can be viewed as a form of KNN: find the users most similar to you, see which items they like, and recommend accordingly.
- Item-based approaches also frequently rely on KNN among item embeddings.
Anomaly detection and fraud detection
- For credit-card transactions or intrusion detection, a KNN approach can isolate points that lack close neighbors in normal regions of the feature space. Weighted or distance-based outlier detection is related to KNN.
Healthcare and diagnostics
- KNN can be straightforwardly applied to diagnosing diseases from clinical metrics, gene expression data, or radiological images.
- Careful metric design or dimensionality reduction is often needed due to the complexity of medical data.
Other industry-specific scenarios
- From retail analytics (predicting store demand by local "neighbors" of stores in feature space) to environmental modeling (predicting pollution at a site from the nearest sensor stations), KNN can be a building block.

Extra expansion: Additional theoretical insights

Because the audience is advanced, it is worth highlighting a few deeper aspects of KNN theory:

Convergence to the Bayes optimal classifier
- For large $m$ and under certain assumptions (i.i.d. sampling, well-defined distance measure, etc.), the KNN classifier's error will approach the Bayes error.
- Specifically, the 1-NN classifier's error is bounded above by $2 \times \epsilon_{\text{Bayes}}$ asymptotically.
Cover trees, vantage-point trees, and other data structures
- Beyond classic k-d trees, many specialized data structures exist to accelerate nearest neighbor searches in metric spaces with certain properties.
Prototype selection
- Reducing the size of the training set to store only the most "informative" points can speed up inference dramatically. Algorithms like Condensed Nearest Neighbor (CNN), Edited Nearest Neighbor (ENN), and STOLP systematically remove outliers or redundant points.
Relation to kernel density estimation
- Weighted KNN with a kernel that depends on distance is conceptually related to non-parametric density estimation. This close connection means you can interpret KNN in a probabilistic sense if you assume certain forms of local densities.

Extremely extended discussion on HNSW and large-scale scenarios

Because large-scale nearest neighbor queries are a big part of modern data science, we add more detail on Hierarchical Navigable Small World (HNSW):

Main idea

Build a layered graph. The bottom layer (level 0) includes all points.
Points also appear in higher layers with some probability $p$ .
Each point in a layer is connected to neighbors within that layer, using some bounding criterion on the number of edges.

Searching in HNSW

When you query a new point $q$ :

Start from a random entry point in the top layer.
Greedily move to the neighbor that is closer to $q$ than your current node, until you can no longer improve.
Descend to the layer below at that node, repeat the greedy search, now with possibly more candidates.
Upon reaching the bottom layer, you carry out a final local search for the $k$ nearest neighbors.

Empirical results show that this yields extremely fast searches, with the "small world" property ensuring you do not get stuck in local minima too easily. The probability-based layering and the random links provide a near $O(\log N)$ complexity in many real data sets, plus insertion or deletion of points is relatively efficient.

Real-world example: Face recognition at scale

Consider a social network with $10^{11}$ user images. Suppose each user face is embedded into a $128$ -dimensional vector space. We can build an HNSW structure to store all those face embeddings. Then, for a new face, we:

Compute its embedding vector (using a deep CNN).
Perform an approximate KNN query in the HNSW to find the top few candidates.
If the same user consistently appears in the top results, we predict that this face belongs to that user.

This approach is used in production systems at internet scale (Malkov and Yashunin, 2018).

Final remarks

The k-nearest neighbors technique, while ancient in terms of machine learning history, remains relevant thanks to modern hardware, advanced indexing, and embedding-based distance metrics. It stands at the intersection of simplicity and potential complexity, bridging purely local, example-based reasoning with sophisticated large-scale system designs.

KNN is conceptually easy to grasp, simple to implement, and surprisingly powerful with the right distance metric or feature embedding. Yet it is crucial to remember the computational burdens, the curse of dimensionality, and the careful engineering steps required for real-world, large-scale usage. By combining dimensionality reduction, approximate nearest neighbor techniques, and careful hyperparameter tuning, practitioners can push KNN well beyond its naive baseline and achieve state-of-the-art performance in many specialized tasks.

References and suggested reading

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Malkov, Yu. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. (A standard text for advanced machine learning and statistics, with coverage of KNN.)
Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research. (Describes LMNN for learning metrics that improve KNN classification.)
FaceNet: Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.

This concludes our comprehensive treatment of the k-nearest neighbors algorithm — from the fundamental concept, through theoretical guarantees, to advanced approximate search structures such as HNSW, and on to practical considerations for real-world deployment.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content