Intro to PyTorch

Intro to PyTorch

How to stop being afraid of pip install

#️⃣   ⌛  ~1 h 🗿  Beginner

15.04.2023

upd:

#42

Intro to PyTorch

How to stop being afraid of pip install

⌛  ~1 h

#42

🎓 65/167

This post is a part of the Deep learning basics educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

PyTorch, an open-source deep learning framework initially developed by researchers at Facebook AI Research (FAIR), has reshaped how scientists and engineers build, train, and deploy neural network models. When PyTorch was first released, it quickly gained popularity due to its dynamic computation graph mechanism and user-friendly interface that combined a Pythonic feel with high-performance automatic differentiation capabilities. This article aims to give you a thorough — yet accessible — deep dive into PyTorch, guiding you through its core concepts, major strengths, and practical applications in modern machine learning workflows.

My goal here is to empower you with theoretical and hands-on knowledge of PyTorch so that, by the time you finish reading, you will be comfortable designing and training neural networks using the framework for both simple toy problems and advanced research tasks. While PyTorch supports an impressive breadth of features, such as distributed training, quantization, mobile deployment, and more, this article will focus primarily on the building blocks essential for constructing and training neural networks. With that said, I will also hint at advanced capabilities — for instance, transfer learning or mixed precision training — as these are now considered fairly standard in cutting-edge machine learning pipelines.

In parallel, it is worth noting the research origin and context of PyTorch. The framework is built atop a C++ backend (ATen) and a Just-In-Time (JIT) compiler. It was showcased in several papers and utilized by many researchers in the machine learning community. For instance, Paszke and gang (NeurIPS 2019) introduced significant improvements in the library's performance and features such as TorchScript for production. If you think about the computational advantages conferred by using GPU-accelerated tensor operations — combined with PyTorch's hallmark dynamic graph creation approach — it is not surprising that PyTorch has become a standard tool in academic labs and at leading companies in data science and artificial intelligence.

key features and benefits of pytorch

PyTorch's primary competitive edge is its intuitive, Pythonic design. In earlier frameworks, notably older versions of TensorFlow (before the eager execution era), one had to define a static graph of computations and then run that graph in a separate session. This static approach was often cumbersome to debug and somewhat less intuitive for developers who are used to Python's imperative style.

By contrast, PyTorch constructs the graph dynamically, meaning the graph is defined "on the fly" during the forward pass. In code, this translates into a style of programming that feels closer to standard Python, with familiar control flows such as loops and conditional statements. When you write something like:


import torch

x = torch.ones(5)
y = torch.zeros(5)
z = x + y  # dynamic creation of the computational graph
print(z)

...the computational graph is built (and soon disposed of, if not needed further) with each operation call. This approach is known as "define-by-run" or dynamic computation graphs, and it allows for interactive debuggers or any real-time introspection you may want to do in your code.

Additionally, PyTorch has a built-in automatic differentiation mechanism called autograd. You simply declare $x.requires\_grad = True$ if you want to track gradients through $x$ . When your forward pass is computed, PyTorch remembers the sequence of operations and, by calling


z.backward()

in code, it can traverse that chain of operations backward, computing derivatives of

z

with respect to the original input(s).

Another crucial aspect is PyTorch's nn module, which provides the building blocks for defining complex neural networks. On top of that, the ecosystem includes a host of utilities for data loading, transformations, and GPU management, making PyTorch a self-contained environment for deep learning.

Beyond these fundamental features, PyTorch fosters a robust community and extensive documentation, including domain-specific libraries like torchvision for computer vision, torchaudio for audio processing, and torchtext for NLP tasks. Thanks to these libraries, one can perform specialized transformations, load canonical datasets, and adopt pre-trained models right out of the box, speeding up experimentation and development cycles.

difference between pytorch and tensorflow

The difference between PyTorch and TensorFlow can be contextualized along several dimensions: graph construction, user experience, deployment, and ecosystem. If you have read our other article covering TensorFlow, you already know that TensorFlow 2.x introduced "eager execution" to address the prior complexity of static graphs. This development brought TensorFlow more in line with PyTorch's dynamic approach. However, even with TensorFlow's improvements, many researchers still find PyTorch's immediate, imperative style more intuitive and more straightforward to debug.

In terms of deployment and production-scale usage, TensorFlow has historically been considered the go-to tool, boasting the more mature TensorFlow Serving and TensorFlow Lite for mobile. PyTorch has more recently added TorchServe for production endpoints and has made strong progress in bridging that gap. Meanwhile, for large-scale projects that require distributed training across massive GPU clusters, both frameworks have robust solutions, though they differ slightly in implementation details. TensorFlow uses tf.distribute strategies, whereas PyTorch provides torch.distributed and torch.nn.parallel modules.

Finally, one major difference is the developer community: PyTorch's community has thrived in academic circles, whereas TensorFlow's adoption was initially stronger in the industry. By now, though, both have become widely used in both domains. As a practical note, it really comes down to personal preference or organization-level decisions; both frameworks, if used correctly, will serve you well in advanced machine learning applications.

call to install and configure it yourself

To follow along with the examples in this article or to experiment on your own, you should install PyTorch by visiting pytorch.org and following the instructions for your operating system and CUDA version. You might choose to create a dedicated virtual environment and run a command such as


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

if you want GPU support. You can also install the CPU-only version by removing CUDA dependencies. As you progress into deeper projects, you may find that setting up your GPU drivers and verifying GPU usage are critical steps, but we will not cover these configuration details here. Just make sure you have a consistent environment to run PyTorch examples reliably.

revisiting tensors and autograd mechanism

In your day-to-day use of PyTorch, you will be working with tensors at nearly every step. A tensor is essentially a generalization of matrices to potentially higher dimensions. For instance, a three-dimensional tensor might represent a batch of images, or a four-dimensional tensor might represent a batch of 3D video frames. The key difference from a standard $numpy$ array is that PyTorch's $torch.Tensor$ objects can perform operations on the GPU with automatic differentiation turned on if required.

Recall that once you set $requires\_grad=True$ on a tensor, PyTorch will begin tracking all operations on it. Underneath the hood, an internal structure known as a "tape" is formed, which records metadata about how each intermediate result was created. Then, when you call .backward() on the final output, PyTorch traverses this tape from the last operation back to the initial tensors, applying the chain rule to compute partial derivatives in an efficient manner.

Mathematically, if you define a function $f(x) = x^2 + 3x + 5$ with $x$ as a tensor that requires gradient, PyTorch can compute $\frac{df}{dx}$ for you automatically:


import torch

x = torch.tensor(2.0, requires_grad=True)
f = x**2 + 3*x + 5  # 4 + 6 + 5 = 15 if x=2
f.backward()        # compute gradients
print(x.grad)       # should print derivative at x=2, which is 2*x + 3 = 7

This mechanism (which is akin to a tape-based autodiff as described in numerous research contexts like Baydin and gang (JMLR 2018)) liberates you from manually coding derivative logic. You can focus on building up complex neural networks layer by layer, safe in the knowledge that your gradients are accurately computed under the hood.

building neural networks

the nn module

The torch.nn module is the cornerstone of neural network design in PyTorch. It provides an extensive set of classes and functions that facilitate defining layers, activation functions, and even sophisticated architectural components such as attention layers. The typical approach is to subclass nn.Module, which represents a trainable model. Inside your subclass, you declare the layers (e.g., linear layers, convolutional layers) in the constructor, and then implement the forward method to define how the data flows through these layers.

Conceptually, nn.Module helps organize parameters, so that you can easily inspect or update them through PyTorch's automatic gradient system. Once your custom network is defined as a subclass of nn.Module, you can drop it into a training loop, pass data to it, and let autograd handle gradient computations for you.

creating a simple feed-forward network

The simplest neural network you can build with PyTorch is often a feed-forward network or info Multi-Layer Perceptron that applies a series of linear transformations (fully connected layers) followed by nonlinear activations. Here is a conceptual example that uses nn.Sequential for brevity, though you can also define a custom class for more flexibility:


import torch
import torch.nn as nn

# Option A: Using nn.Sequential
model = nn.Sequential(
    nn.Linear(32, 64),   # input shape: 32 features
    nn.ReLU(),          # activation
    nn.Linear(64, 10)   # output shape: 10 classes
)

# Option B: Subclass nn.Module
class SimpleFeedForward(nn.Module):
    def __init__(self):
        super(SimpleFeedForward, self).__init__()
        self.fc1 = nn.Linear(32, 64)
        self.fc2 = nn.Linear(64, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model_b = SimpleFeedForward()

In either approach, the essential principle is that you define the shape of your input and output, and the network is composed of these building blocks. Whenever you pass data into model, the forward pass is automatically constructed, and the required gradient logic is appended to PyTorch's internal tape. Then, when you compute your loss and call .backward(), the gradients flow back through fc2, fc1, and so on, updating all .weight and .bias parameters in each layer.

activation functions

Non-linear activation functions are critical in neural networks because they allow networks to approximate complex, non-linear mappings from inputs to outputs. The nn module (and torch functional modules) provide many standard activation functions:

nn.ReLU
nn.Sigmoid
nn.Tanh
nn.LeakyReLU
nn.Softmax
...and more.

If you recall from prior articles, an activation function like ReLU ( $ReLU(x) = \max(0, x)$ ) zeroes out negative values and keeps positive values unchanged. Sigmoid is commonly used in binary classification tasks, mapping real inputs to a range of $(0,1)$ .

From a gradient perspective, these functions each exhibit different properties. For instance, ReLU's gradient is either 1 for positive inputs or 0 for negative inputs, which can lead to the "dying ReLU" problem if many neurons become inactive. Meanwhile, the sigmoid function saturates at extremes, potentially causing issues with vanishing gradients. Understanding these quirks can help you pick the right activation function for your architecture and domain.

forward and backward passes

Recall that the forward pass is the process of feeding input data through the network to obtain predictions. In PyTorch, it is often just calling the model as if it were a function, e.g.,


outputs = model(inputs)

When you do that, PyTorch's autograd system records all relevant operations. Next, you typically compute a loss. For instance, if you are working on a classification task, you might use nn.CrossEntropyLoss, which implements the cross-entropy function typically used in multi-class classification. Something like:


criterion = nn.CrossEntropyLoss()
loss = criterion(outputs, labels)

Once you have the loss, you call loss.backward() to trigger the backward pass. This is where PyTorch applies the chain rule to compute the partial derivatives of loss with respect to each parameter. The partial derivatives are stored in the .grad attribute of each parameter (i.e., each weight and bias matrix in the network).

While these steps might seem mechanical, they are the foundation of training in PyTorch. The forward pass is used to make predictions and compute a measure of error, and the backward pass is used to compute updates that will reduce that error during the next iteration. This approach is consistent with the fundamental gradient descent idea, which we have explored in detail in earlier articles of this course.

data handling

dataset and dataloader

Real-world data often does not come in neat $numpy$ arrays or ready-made PyTorch tensors; it may involve images in various folders, text logs, or sensor data streams. PyTorch addresses this challenge through the torch.utils.data package, in particular the Dataset and DataLoader classes.

Dataset: A blueprint for how to access and process your data. It might store references to your files (e.g., image paths), handle reading them from disk, and transform them into tensors.
DataLoader: Wraps a Dataset and provides iteration over the data in mini-batches, typically with shuffling, parallel workers, and so on.

For example, if you want to load the MNIST dataset for digit classification, PyTorch's torchvision library offers torchvision.datasets.MNIST, which handles downloads, transformations, and the getitem logic for you. You would then wrap that in a DataLoader:


from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.ToTensor()

mnist_train = datasets.MNIST(root='data', train=True, download=True, transform=transform)
mnist_val = datasets.MNIST(root='data', train=False, download=True, transform=transform)

train_loader = DataLoader(mnist_train, batch_size=64, shuffle=True)
val_loader = DataLoader(mnist_val, batch_size=64, shuffle=False)

Once this setup is done, you can loop over train_loader in your training loop, retrieving mini-batches of data and labels.

custom datasets

When your data is less standard, you can subclass torch.utils.data.Dataset and implement two key methods: len, which returns the size of your dataset, and getitem, which returns the $i$ -th data sample and label (or target). For example:


from torch.utils.data import Dataset

class MyCustomDataset(Dataset):
    def __init__(self, file_paths, labels, transform=None):
        self.file_paths = file_paths
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        return len(self.file_paths)
    
    def __getitem__(self, index):
        # load data from file
        data = load_data_file(self.file_paths[index])
        label = self.labels[index]
        if self.transform:
            data = self.transform(data)
        return data, label

Here,


load_data_file

is a hypothetical function that handles reading from disk or from a network location. This approach is flexible enough to manage any domain-specific data structure, from tabular data in CSV files to 3D point clouds.

data transformations and augmentation

Data transformations can be handled in many ways, but torchvision.transforms is one of the more popular modules for computer vision tasks. It provides operations like random cropping, resizing, flipping, etc. The notion of augmentation is crucial in many deep learning applications, especially in computer vision, because it artificially increases the variety of your training data and can help mitigate overfitting.

For example:


import torchvision.transforms as T

transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.RandomResizedCrop(size=224),
    T.ColorJitter(brightness=0.1, contrast=0.1),
    T.ToTensor()
])

In natural language processing or audio tasks, you have analogous transformations: tokenization, random time-shifts, pitch changes, and so on. The concept is the same — systematically modify the original data in ways that preserve the semantic meaning while giving the model more robust training signals.

An image was requested, but the frog was found.

Alt: "data-loading-process-diagram"

Caption: "A conceptual illustration of how data is loaded, transformed, and batched in PyTorch using DataLoader and Dataset."

Error type: missing path

model training

defining the training loop

The next step after constructing your model and data pipeline is to define a training loop. This loop orchestrates the forward pass, loss calculation, backward pass, and parameter updates over multiple epochs. Below is a conceptual skeleton in PyTorch:


model = SimpleFeedForward()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

num_epochs = 10

for epoch in range(num_epochs):
    for batch_idx, (inputs, labels) in enumerate(train_loader):
        
        # 1. Zero the gradient buffers
        optimizer.zero_grad()
        
        # 2. Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # 3. Backward pass (compute grads)
        loss.backward()
        
        # 4. Update parameters
        optimizer.step()

    print(f"Epoch {epoch+1} completed.")

While this code is straightforward, there are many details that can be added or tweaked, including:

Moving inputs and labels to a GPU device if available (i.e., inputs = inputs.cuda()).
Accumulating statistics or metrics like accuracy within each batch to track progress.
Managing learning rate schedules or other hyperparameter modifications through PyTorch's torch.optim.lr_scheduler.

Many advanced training scenarios revolve around this pattern. In large-scale contexts, you might rely on a distributed data parallel strategy to replicate this loop across multiple machines, each working on a subset of the data. Still, the fundamental logic remains the same.

optimizers and loss functions

PyTorch provides a variety of optimizers through torch.optim, from standard Gradient Descent (SGD) and Momentum to Adam (Kingma & Ba, ICLR 2015), RMSProp, Adagrad, etc. Each optimizer has its own set of hyperparameters, typically including a learning rate ( $lr$ ) and momentum terms or $\beta$ parameters, as in Adam. The general formula for updating parameters $w$ via gradient descent is:

w \leftarrow w - \eta \nabla_w L(w)

Where $\eta$ is the learning rate and $\nabla_w L(w)$ is the gradient of the loss with respect to $w$ . More sophisticated optimizers incorporate momentum terms or adaptive learning rates.

Loss functions are found in torch.nn, with cross-entropy, mean squared error, L1 loss, and many others available. In classification tasks, nn.CrossEntropyLoss is standard for multi-class problems, while nn.BCELoss or nn.BCEWithLogitsLoss is used for binary classification. For regression tasks, nn.MSELoss or nn.L1Loss are more common. One can also define custom losses by writing your own function that returns a scalar tensor for which $requires\_grad=True$ .

validation and testing

To ensure that your model is generalizing properly rather than memorizing training data, you should frequently measure validation performance. Often, one keeps aside a validation set (or uses cross-validation, as discussed in earlier articles). The typical pattern for validation in PyTorch might be:


model.eval()  # put model in eval mode (e.g., disables dropout)
val_loss = 0
with torch.no_grad():  # no need to compute gradients
    for inputs, labels in val_loader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        val_loss += loss.item()

val_loss /= len(val_loader)
print(f"Validation Loss: {val_loss}")
model.train()  # back to training mode

In eval mode, certain layers like dropout or batch normalization may behave differently than in training mode, ensuring consistent performance metrics. During actual "testing" or final evaluation, the procedure is analogous, though typically your test set is strictly held-out data from training.

other techniques: transfer learning, distributed training, mixed precision training, saving, loading, etc.

Although the fundamentals of building neural networks with PyTorch involve the steps we have already covered — data loading, model definition, training loop, etc. — it is worth taking some time to acknowledge additional techniques that are essential in modern deep learning:

Transfer Learning: A powerful approach where you take a model pre-trained on a large dataset (e.g., ImageNet for vision tasks) and fine-tune it on your own dataset, typically with fewer training samples. PyTorch's torchvision.models includes well-known architectures such as ResNet or VGG, which can be loaded with pre-trained weights. By freezing early layers or adjusting the final layer to match your custom dataset's number of classes, you can leverage transfer learning to achieve high performance with minimal training from scratch.
Distributed Training: PyTorch's torch.distributed or DataParallel approach allows you to scale to multiple GPUs on a single machine or even multiple machines. This is crucial when training large models or dealing with massive datasets. In the DistributedDataParallel model, each GPU sees a portion of the data, and gradients are synchronized across all workers after each batch. This distributed strategy is commonly used in both research and production contexts, especially for tasks like large-scale language models or vision transformers.
Mixed Precision Training: Also known as half-precision training, or using $float16$ for specific operations to reduce memory usage and speed up computation on modern GPUs that support Tensor Cores. PyTorch provides torch.cuda.amp, which enables automatic mixed precision. The concept behind this is to keep certain computations in higher precision to preserve numerical stability, while other computations can be performed in half precision for efficiency gains. Typically, mixed precision leads to faster training times while requiring less GPU memory, without sacrificing accuracy in most cases.
Saving and Loading Models: It is very common to save model weights periodically during training for checkpointing purposes. In PyTorch, this is done with torch.save for saving and torch.load for loading:
```
# Saving
torch.save(model.state_dict(), "model_weights.pth")

# Loading
model = SimpleFeedForward()
model.load_state_dict(torch.load("model_weights.pth"))
```
This approach (using state_dict) is recommended over trying to save the entire nn.Module object directly. The state_dict only contains the model parameters and buffers, which is typically enough to reconstruct the model if you have the same code for your architecture.
Model Checkpointing and Early Stopping: In real projects, you often do not want to wait until the end of training to see if the model overfits. Instead, you can regularly evaluate your validation set performance and store the best-performing weights. Early stopping is used when you notice the validation performance ceases to improve or starts to degrade, at which point continuing training might lead to overfitting or wasted compute resources.
Integration with Ecosystem Tools: PyTorch works seamlessly with many other advanced libraries and frameworks. Examples include:
- PyTorch Lightning or fastai: Provide higher-level abstractions on top of pure PyTorch, simplifying or automating boilerplate code for training loops, logging, or checkpointing.
- TensorBoard or Weights & Biases: Tools for logging training metrics and visualizing them in real-time.
- ONNX: Open Neural Network Exchange for exporting models built in PyTorch to other runtimes.
Advanced Topics: If you venture even further, you may encounter JIT compilation via torch.jit, quantization for edge deployment, or HPC-oriented solutions using NVIDIA's APEX. Additionally, large language models (LLMs) use sophisticated distributed strategies like sharded gradients or model parallel approaches, all of which are available to some extent in PyTorch or third-party libraries built on top of it.

An image was requested, but the frog was found.

Alt: "distributed-training-diagram"

Caption: "Schematic depiction of distributed data parallel training with multiple GPUs."

Error type: missing path

All these capabilities make PyTorch not just a friendly tool for building your first neural network but also a robust, production-ready solution for the entire cycle of designing, training, and deploying advanced machine learning models. Indeed, the library is used in wide-ranging domains from classical image classification tasks to generative adversarial networks (GANs), from reinforcement learning to complex multi-modal architectures.

I encourage you to set up a small project using PyTorch. Download or generate a dataset, define a custom Dataset, build a small feed-forward network, pick an optimizer, and train. Experiment with different hyperparameters — for instance, changing the learning rate or trying out different optimizers — to see how it influences your results. From there, you can move on to more sophisticated architectures, incorporate advanced data augmentations, or even try distributed training if you have the resources.

PyTorch is a vast ecosystem, and what we covered here represents only the core fundamentals that every data scientist or machine learning engineer should know. As you progress, you may find yourself diving deeper into specialized tasks like computer vision (with torchvision), NLP (with torchtext or Hugging Face Transformers), or audio (with torchaudio). Each domain has unique challenges, but PyTorch's consistent API and dynamic graph approach make the learning curve more manageable.

If you keep building upon the fundamental knowledge from this article, you will be well-prepared to tackle the more advanced topics in neural network design, distributed strategies, or domain-specific applications that are introduced in subsequent articles of this course. Above all, remember that PyTorch's design philosophy is about flexibility, immediacy, and readability — that is, you should feel empowered to experiment quickly, debug in real time, and adapt your code to novel research ideas or production constraints without wrestling with a difficult workflow.

That concludes the overview and deep dive on PyTorch, covering everything from the basics of dynamic computation graphs and tensor-based data loading to advanced references on distributed training and model saving/loading. I look forward to seeing how you combine these principles and techniques in your future machine learning ventures!

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content