BERT model

BERT model

Our first transformer

#️⃣   ⌛  ~1 h 🤓  Intermediate

23.09.2023

upd:

#73

BERT model

Our first transformer

⌛  ~1 h

#73

🎓 84/2

This post is a part of the Transformers educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Transformer-based architectures have significantly reshaped the landscape of natural language processing (NLP). Among these models, BERT (which stands for Bidirectional Encoder Representations from Transformers) has proven to be a milestone in helping machines understand and generate human language more effectively than many preceding approaches. BERT, first proposed by Devlin and gang (2018), represented a paradigm shift: instead of reading text strictly from left to right or from right to left (like older language modeling approaches), it introduced a deeply bidirectional mechanism that allows the model to learn contextual representations by integrating information from both directions simultaneously. This characteristic not only improved performance across a broad suite of NLP tasks but also spurred a wave of follow-up research into more powerful, efficient, and specialized Transformer-based models.

Why did BERT resonate so strongly throughout the AI community? Much of its success can be attributed to two core strengths:

Bidirectional encoding, which, unlike traditional unidirectional language models, can see the entire context when predicting or reasoning about any single token in a sentence.
Its flexibility and fine-tuning simplicity, allowing the same pre-trained BERT backbone to excel on a variety of NLP tasks (e.g., sentiment classification, named entity recognition, question answering) with minimal modifications.

In earlier unidirectional language models, the system would encode context information from either left to right or right to left. For instance, if the language model reads text left to right, it only incorporates the words that came before the current position; thus, context from upcoming words is missing. BERT instead uses a masked language modeling objective to enable a bidirectional comprehension of text during pre-training, leading to richer language representations. These representations capture nuanced semantic and syntactic relationships within text, ultimately allowing BERT to achieve state-of-the-art (SOTA) or near-SOTA results on many language benchmarks and tasks soon after its release.

In this article, I dive deeply into BERT's architecture, pre-training objectives, fine-tuning procedures, and practical applications. I will also touch on best practices, pitfalls, and advanced topics that practitioners often encounter when applying or modifying BERT in real-world or research-based scenarios. Along the way, I will provide expanded explanations, code snippets, and references to advanced findings that build upon the original BERT model. Because BERT is built on top of the Transformer encoder design, I will start by briefly revisiting the essential details of the Transformer framework to keep the discussion self-contained. After that, we will take a closer look at the specific features that make BERT so transformative in the realm of NLP.

Revisiting core concepts of transformers from the previous article

Transformers introduced a novel approach to sequence transduction in NLP. In essence, Transformers forego traditional recurrent neural network (RNN) architectures and their sequential bottleneck for computing hidden states. Instead, Transformers utilize self-attention mechanisms, allowing each token in a sequence to attend to every other token simultaneously. This design leads to significantly parallelizable computations and, more importantly, to the emergence of contextually informed representations at each layer.

The core building blocks of a Transformer encoder (which is the part BERT predominantly uses) are:

Multi-head self-attention:
- This mechanism enables the model to project queries, keys, and values (representations of tokens) into multiple subspaces.
- Each subspace can learn different aspects of the relationship among tokens.
- The outputs from these multiple "heads" are then concatenated and linearly transformed to form the final representation.
Feed-forward network:
- After the multi-head attention layer, each token representation flows into a position-wise feed-forward network, typically a two-layer fully connected network with an activation function in between.
- The feed-forward component helps enrich the model's representational capacity and allows for more non-linear transformations.
Layer normalization and residual connections:
- The Transformer architecture leverages residual (skip) connections around the sub-layers, plus layer normalization, which stabilizes training and speeds up convergence.
Positional encoding:
- Because Transformers do not process tokens in a strictly sequential manner (as RNNs do), they require a way to introduce positional information.
- Typically, this positional information can be sinusoidal or learned embeddings that represent the index of each token in the sequence, so the model has some sense of ordering.

BERT specifically adopts only the encoder portion of the Transformer blocks. There is no explicit decoder portion (commonly used for language generation in tasks like machine translation) in standard BERT. By taking advantage of the Transformer encoder's power to model contextual relationships in a bidirectional fashion, BERT can produce highly contextualized word/token embeddings at each layer, culminating in robust language representations.

BERT architecture overview

BERT is essentially a deeply stacked series of Transformer encoders: for BERT_base, there are 12 Transformer encoder layers, whereas BERT_large has 24. Each layer houses a multi-head attention module and a feed-forward sublayer. However, BERT also comes with additional modifications to effectively handle input data and produce specialized outputs:

Input representation in BERT

One of the distinctive aspects of BERT is how it encodes input sequences before they are fed into the Transformer layers. BERT's input embeddings combine three different types of embeddings:

Token embeddings:
- These are similar to the word embeddings used in many NLP models, but in BERT, subword units are used (WordPiece).
- Each token in the input sequence (e.g., a subword) is mapped to a learned, continuous vector.
Segment embeddings:
- BERT can handle either a single sequence of text or a pair of sequences.
- For tasks like next sentence prediction (NSP), or question answering where there is a pair of sentences or sentence + question context, BERT uses segment embeddings (sometimes referred to as sentence A embeddings and sentence B embeddings) to distinguish which tokens belong to which sequence.
- This embedding is a learned vector that is added to the token embedding.
Positional embeddings:
- As the Transformer architecture needs positional information, BERT uses learned positional embeddings (in contrast to the original sinusoidal approach used in the original Transformer paper by Vaswani and gang).
- Each position in the sequence has a unique embedding that is added to both the token and segment embeddings.

The final embedding for each token position is the sum of $\text{TokenEmbedding} + \text{SegmentEmbedding} + \text{PositionEmbedding}$ . These embeddings, after summation, feed into the first layer of BERT's Transformer encoder stack, initiating the forward pass.

Special tokens ([CLS] and [SEP])

BERT introduces two special tokens in its input format:

[CLS]: Placed at the start of every input sequence. The hidden state corresponding to [CLS] in the final layer of BERT often serves as a representation for classification tasks.
[SEP]: Used to delimit sequences. This token is placed at the end of each sequence, enabling BERT to handle either one or two sequences. When there are two sequences, for example in NSP tasks, a [SEP] is placed at the boundary between the first and second sequence, and again at the end of the second sequence.

These special tokens help BERT keep track of the context for classification or subsequent tasks. For tasks like single-sentence classification, the [CLS] token's final hidden state is typically used as the input to a classification layer. For pairwise tasks, BERT can attend to both sequences and compute cross-sequence contextual encodings.

Transformer encoder blocks

Within BERT, each Transformer encoder block has the following main layers:

Self-attention sublayer:
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$
- Here, $Q$ (query), $K$ (key), and $V$ (value) are projected from the previous layer's hidden states.
- $d_k$ is a scaling factor (the dimensionality of the key vectors).
- Multi-head attention replicates these computations multiple times in parallel subspaces, capturing diverse relational patterns among tokens.
Feed-forward sublayer:
$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$
- This position-wise feed-forward network applies two linear transformations with an activation function (usually GELU or ReLU) in between.
- The same set of parameters is applied to each position independently.
Residual connection and layer normalization:
- After each sublayer, the input to the sublayer is added to its output, forming a residual pathway.
- Layer normalization is applied on top to stabilize gradients.

Each encoder block progressively refines contextual token representations, culminating in a deep network that can capture complex syntactic and semantic interactions across a text.

Output layers and representations

After passing through all Transformer encoder layers, BERT produces a contextual embedding for each position in the input sequence. In particular:

The output corresponding to [CLS] at the final layer is often used for classification tasks (e.g., sentiment classification, next-sentence prediction).
The output corresponding to each token (including subwords) can be utilized for token-level predictions (e.g., named entity recognition) or for tasks requiring an understanding of each word's context.
Intermediate layers can also be used. Sometimes, researchers or practitioners extract embeddings from earlier layers if needed for certain tasks or if they find better generalization in mid-layer representations.

These final or intermediate hidden states are subsequently fed into a task-specific head — for instance, a linear layer for classification or a specialized architecture for question answering. Despite the variety of tasks, BERT's multi-task potential is made possible by the universal nature of its deep bidirectional language representations learned during pre-training.

Pre-training objectives

The real power of BERT is most apparent when we examine how it is initially trained at scale. BERT uses large unlabeled text corpora and learns robust language representations by optimizing two main objectives. Through these tasks, BERT acquires general linguistic and semantic knowledge that can later be swiftly adapted to downstream tasks with minimal changes.

Masked language modeling (MLM)

Perhaps the key innovation in BERT is its usage of masked language modeling. The goal here is for the model to predict randomly masked tokens within the input. Traditionally, language models read text from left to right and try to predict the next word given the previous ones. That constraint prevents the model from using the entire bidirectional context. MLM relaxes that constraint:

A certain percentage (typically 15%) of the tokens in the input are replaced with a [MASK] token, or sometimes with a random token, or remain the same.
The model attempts to reconstruct the original token in these masked positions, thereby effectively learning to attend to the entire left and right context.

Concretely, the MLM objective can be represented as optimizing the log-likelihood of the masked positions. If $x$ is the input token sequence and $M$ is the set of masked positions, the model maximizes:

\sum_{i \in M} \log p(x_i \mid x_{\setminus M})

where $x_{\setminus M}$ denotes the input sequence with the masked tokens excluded or replaced. By training on this objective, BERT learns to leverage contextual clues from both sides of each masked token.

Next sentence prediction (NSP)

In addition to understanding individual token contexts, BERT also tries to learn relationships between sentences. The second core pre-training task is next sentence prediction (NSP):

The model takes pairs of sentences $(S_1, S_2)$ as input. 50% of the time, $S_2$ is the actual subsequent sentence of $S_1$ in the original text. For the other 50%, a random sentence is sampled from the corpus.
BERT is trained to predict whether $S_2$ is the real next sentence (IsNext) or a random sentence (NotNext).

This objective helps BERT capture inter-sentence coherence, which is crucial for tasks like question answering or natural language inference, where deeper understanding of multi-sentence contexts is pivotal. The [CLS] token's final hidden state is typically used to perform this NSP classification.

Role of large-scale corpora in pre-training

BERT is typically trained on massive datasets, such as the English Wikipedia (2.5B words) and the BookCorpus (800M words). The breadth and size of these datasets provide diverse language patterns, enabling BERT to generalize to varied text-based tasks. This large-scale unsupervised pre-training is akin to giving BERT a thorough language education before fine-tuning it on any specific tasks.

Alternative objectives and techniques

Since BERT's debut, multiple studies have explored alternative or improved pre-training objectives:

Whole-word masking: Instead of masking subword tokens at random, entire words are masked, which might help the model learn better word-level semantics.
Permutation language modeling: Introduced by XLNet (Yang and gang), it combines the bidirectional context idea with a permutation-based autoregressive formulation.
Span masking: T5 (Raffel and gang) masks contiguous spans of text and trains the model to recover them. This approach can better reflect the correlation between consecutive tokens.
Sentence order prediction: RoBERTa (Liu and gang, 2019) removes NSP entirely, focusing on a pure masked language modeling approach with more data and more training steps, or uses sentence order predictions with different sampling strategies.

These innovations highlight the ongoing evolution of pre-training methods. Nonetheless, the original MLM+NSP approach remains widely recognized as BERT's signature method.

Fine-tuning procedures

After pre-training, BERT can be applied to a wide variety of downstream tasks. Because BERT is intended to be fine-tuned — not used purely as a fixed feature extractor — the process typically involves adding minimal new parameters for the specific task and resuming gradient-based training on labeled data.

Task-specific heads

When fine-tuning BERT, practitioners usually add a classification or token-level classification layer on top of the final hidden states. For instance:

Classification tasks: A linear layer is placed on top of the [CLS] token. This setup is used for sentiment analysis, textual entailment, and other tasks needing a single label.
Token-level classification tasks: Tasks like named entity recognition (NER) or part-of-speech tagging require a label for each token, so the final hidden state of each token is fed into a linear layer.
Question answering tasks: Models predict the start and end token in a passage that answers a question. Hence, two linear output layers map each token's final representation to a probability distribution over start positions and over end positions, respectively.

Common tasks

Because of its success on the GLUE benchmark and SQuAD, BERT has become a staple approach for typical NLP tasks:

Classification (sentiment analysis, spam detection, etc.)
Question answering (SQuAD, TriviaQA, etc.)
Named entity recognition (annotating tokens for categories such as Person, Location, Organization)
Natural language inference (entailment, contradiction, neutral classification for sentence pairs)
Text summarization and paraphrasing (with additional modifications or decoders)

Hyperparameter considerations

When fine-tuning BERT, careful hyperparameter selection can significantly impact performance and convergence. Key factors include:

Learning rate: Typical ranges are between $2e^{-5}$ and $5e^{-5}$ for many tasks, but a small difference can drastically alter outcomes.
Batch size: Larger batch sizes can stabilize gradient estimates but also require more memory. Common batch sizes range from 16 to 64.
Warm-up steps: BERT often benefits from a linear warm-up schedule for the learning rate, transitioning from near zero to the target learning rate over a portion (e.g., 10%) of the training steps.
Epochs: Fine-tuning typically ranges from 2 to 4 epochs for many classification tasks, though more complex tasks or smaller datasets might require more.

Best practices for regularization

To prevent overfitting, especially on smaller labeled datasets, you can use:

Dropout: Not only in the feed-forward and attention sublayers (as in the Transformer architecture) but also in final classification heads. Common dropout rates are 0.1 or 0.2.
Early stopping: Monitoring development set performance to stop training when improvements stagnate.
Data augmentation: Paraphrasing, back translation, or entity replacement can artificially increase the size of your dataset.
Layer-wise learning rate decay: Lower learning rates for earlier layers, since these layers are already well-trained from the pre-training stage, and slightly higher learning rates for the later, more task-specific layers.

Practical applications of BERT

BERT's power is demonstrated in how easily it can be applied to diverse NLP tasks. Below are some key areas where it has delivered substantial value:

Natural language inference

BERT can be used to classify pairs of sentences into categories like "entailment", "contradiction", or "neutral". This ability arises from the next-sentence prediction pre-training. In practice, the two sentences are fed into BERT as a single sequence separated by the [SEP] token, with the [CLS] token at the start. The final hidden state of [CLS] is then used for classification.

Sentiment analysis

Whether for binary (positive/negative) classification or more nuanced multi-class sentiment tasks, BERT's final [CLS] representation or the average pooling of token representations can be fed to a classification layer. BERT's contextual embeddings capture subtle language cues, sarcasm, and domain-specific language better than more naive approaches.

Named entity recognition

Many organizations and researchers employ BERT to label tokens in text with categories like Person (PER), Location (LOC), or Organization (ORG). Each final hidden state corresponding to a token is mapped to a label. The deep bidirectional context helps BERT disambiguate tricky entity boundaries and names with multiple possible types.

Question answering systems

In tasks like SQuAD (Stanford Question Answering Dataset), a question and a passage are concatenated with [CLS] and [SEP] tokens. BERT learns to identify the start and end positions of the answer within the passage. Thanks to BERT's deep contextual understanding, it can pinpoint answer spans even if paraphrased from the original question.

Text summarization and paraphrasing

While BERT by itself is not typically a generative model, it can serve as the encoder component in encoder–decoder architectures for summarization or paraphrase generation. Some approaches adapt BERT's output for extractive summarization (selecting key sentences) or incorporate additional decoders for abstractive summarization.

These practical applications underscore how broadly BERT can be leveraged with minimal architectural adjustments.

Implementation

To illustrate BERT's usage in real systems, let's walk through a typical BERT implementation pipeline. We will explore:

Step-by-step outline of how to set up BERT in a framework like PyTorch.
How to integrate BERT for downstream tasks.
Key configuration parameters.
Memory and computational considerations, plus advanced optimization strategies for large models.

Implementing BERT step-by-step

The following conceptual workflow outlines how you might implement BERT or load a pre-trained version for fine-tuning:

Install Dependencies:
- Python 3.7+, PyTorch (or TensorFlow), Transformers library by Hugging Face (or another library such as Fairseq).

Load a pre-trained BERT model (e.g., bert-base-uncased) from a library hub:


import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load pre-trained model, specifically for sequence classification tasks
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Prepare your data:
- Tokenize input sentences (or sentence pairs) using the BERT tokenizer.
- Convert tokens to input IDs, segment IDs, and attention masks.
Set up the training loop:
- Choose your loss function (often cross-entropy) and an optimizer like AdamW with weight decay.
- Decide on a learning rate schedule with optional warm-up.
- Create batches of your data and feed them to the model, computing the forward pass, loss, and backpropagation.
Evaluate on a validation set:
- Keep track of your chosen metrics (accuracy, F1, etc.).
- Adjust hyperparameters as needed.

BERT integration and inference (code)

As an example for a binary text classification task (say, sentiment analysis), once your model is trained, usage for inference might look like:


# Suppose you have a trained model and a tokenizer
text = "This new restaurant is fantastic!"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    predictions = torch.argmax(logits, dim=-1)
    
# predictions[0] will give you the predicted class ID
print("Predicted class:", predictions[0].item())

In production, you might package this logic into a web service, load your fine-tuned model, and expose an API endpoint that returns a sentiment label for any given text input.

Popular frameworks (TensorFlow, PyTorch)

BERT can be used in both TensorFlow and PyTorch. The Hugging Face Transformers library provides a unified API for both frameworks. You can also use TensorFlow's official BERT implementation or other implementations if you prefer. PyTorch is often chosen for research and experimentation due to its pythonic imperative style and dynamic graph approach.

Model configuration

A BERT model is typically specified by the following parameters:

Hidden size (e.g., 768 for BERT_base)
Number of layers (e.g., 12 for BERT_base)
Number of self-attention heads (e.g., 12 for BERT_base)
Intermediate size in the feed-forward networks (e.g., 3072 for BERT_base)
Vocabulary size (e.g., 30522 for the uncased model)
Type (segment) vocab size (e.g., 2 for sentence A and sentence B)

For BERT_large, the hidden size is typically 1024, with 24 layers and 16 attention heads. The trade-off is that BERT_large can capture more nuanced patterns but requires significantly more memory and computational resources.

Handling large models and memory constraints

Full-scale BERT, especially BERT_large, can exceed typical GPU memory limits if the batch size is large. Some strategies to handle this include:

Gradient checkpointing: Checkpoint intermediate activations to reduce memory usage at the cost of extra computation for backprop.
Mixed precision training: Use half-precision floating point (FP16) to cut down memory usage and speed up training.
Distributed training: Utilize multi-GPU or multi-node setups to distribute the data or model parameters.
Model distillation: Distill BERT into a smaller model (e.g., DistilBERT) that preserves much of the performance while being more compact.

Tokenization approaches

BERT's original implementation uses a subword approach known as WordPiece. This algorithm splits words into smaller pieces when the word is not in the vocabulary, enabling BERT to handle out-of-vocabulary (OOV) words gracefully. Other popular subword tokenizers include Byte-Pair Encoding (BPE) and SentencePiece. Regardless of the method, the goal is to handle the vast vocabulary space in a more flexible way than purely word-level tokenization.

Tips, tricks, and best practices

While BERT is powerful, it can be sensitive to hyperparameters and easy to overfit if the data is limited. Here are some advanced tips:

Training stability

Gradient clipping can prevent exploding gradients. Typically, a maximum gradient norm of 1.0 or 5.0 is used.
Careful weight initialization is less of a concern if you are loading a pre-trained model, but any new layers on top (like classification heads) should be properly initialized.
Monitor loss carefully. If the loss becomes NaN or diverges, it might be a sign the learning rate is too high or batch size is too large.

Learning rate scheduling

Schedules like linear decay or cosine decay with warm-up steps are commonly used. Warm-up can help in the early stages of training when the model weights are close to their pre-trained values.

Effective batch sizes

Because BERT was pre-trained with large batch sizes (e.g., 256+), small fine-tuning batch sizes might deviate from those conditions. You may use gradient accumulation to simulate a larger effective batch size if GPU memory is limited.

Regularization and dropout

Dropout in attention heads and feed-forward sublayers is typically set to 0.1. Adding dropout to the classification head can help avoid overfitting. On smaller datasets, you might increase dropout to 0.2 or 0.3.

Model interpretability

Attention visualization is a popular approach to glean how BERT focuses on certain words in a sequence. Additionally, layer probing techniques can show which layers are most responsible for capturing linguistic features like syntax, co-reference, or semantics.

If you want to illustrate attention patterns in a teaching or demonstration environment, you could generate a visualization that shows the attention weights from different heads. For example:

An image was requested, but the frog was found.

Alt: "Self-attention visualization"

Caption: "Illustration of attention weights for a sample sentence in BERT."

Error type: missing path

This kind of visualization can help you or your audience understand how the model distributes focus across tokens.

Common pitfalls and troubleshooting

Despite BERT's maturity, certain pitfalls and issues still frequently arise:

Overfitting in small datasets

BERT's deep architecture can quickly overfit a small dataset, especially if you fine-tune for too many epochs. Solutions include:

Using a smaller learning rate or fewer epochs.
Employing data augmentation techniques.
Freezing some of the early layers of BERT so they do not update during fine-tuning.

Evaluation metrics and misinterpretations

Accuracy alone might not convey the full picture, especially for imbalanced classification tasks (like NER or sentiment tasks with skewed label distributions). Consider evaluating Precision, Recall, and F1-score to get a more nuanced view. Also watch out for how your data split and cross-validation procedures could lead to inflated or deflated performance estimates.

Handling domain shifts

If your downstream domain diverges significantly from the data BERT was pre-trained on (e.g., medical text vs. general text), BERT might underperform. Domain-specific adaptation can help: re-run MLM on domain-specific unlabeled text (referred to as domain-adaptive pre-training), or use specialized models like BioBERT for biomedical texts.

Practical constraints (compute time, GPU/TPU limits)

BERT is computationally heavy, especially at the pre-training stage. Although you can fine-tune a pre-trained model on a single GPU, for large tasks or large-scale hyperparameter sweeps, you might need a multi-GPU or TPU environment. When resources are constrained, distilling BERT to a smaller network or using smaller variants (ALBERT, MobileBERT, TinyBERT) is common.

Debugging convergence issues

If training accuracy remains at chance level or the model fails to converge, check:

Tokenization code: Are you applying the same vocabulary and preprocessing steps as the pre-trained model?
Segment IDs and attention masks: Make sure these are correct for single-sentence or two-sentence tasks.
Learning rate or batch size: Often a suboptimal learning rate leads to training instabilities.

Throughout the last few years, BERT has served as an exemplar of how large-scale self-supervised pre-training can revolutionize the performance of language models. By adopting a deeply bidirectional Transformer encoder design and combining it with innovative pre-training tasks (MLM and NSP), BERT demonstrated that it is possible to train general-purpose language representations that adapt well to numerous tasks. As the NLP domain evolves, BERT continues to remain relevant, either in its original form or as the foundation from which countless derivative models (RoBERTa, ALBERT, DistilBERT, etc.) have emerged.

Because of its ubiquitous presence in both academic research and industry deployments, understanding BERT's underlying mechanisms, training objectives, and best practices is practically essential for anyone working deeply with modern NLP. Whether you are building a sentiment classifier, a question answering system, or a multi-turn dialogue agent, BERT offers a robust starting point. Its success also foreshadowed the wave of even more powerful language models that have since come onto the scene — but for many standard NLP tasks, BERT still provides a reliable, efficient, and effective baseline.

I encourage you to experiment with BERT-based pipelines on your own tasks, carefully tune hyperparameters, and watch how swiftly it adapts to new data. By mastering BERT fundamentals, you will be well-prepared to venture into the broader ecosystem of Transformer-based architectures and further harness the ever-expanding capabilities of large language models.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content