pytorch fundamental concepts
Tensor Definition and Example in PyTorch
A tensor is a multi-dimensional array used to represent data in machine learning. It generalizes scalars (0D), vectors (1D), and matrices (2D) to n dimensions, enabling the storage and processing of complex data structures such as time-series, sequences, or images.
In PyTorch, tensors are the fundamental data structure used for input data, model parameters, and gradients. They support GPU acceleration and automatic differentiation.
Example: Stock Price Prediction
Suppose you're building a model to predict stock prices using the past 10 days of closing prices for each stock. For a batch of 3 stocks, where each entry contains 10 days of prices (1 feature: closing price), the data shape will be:
[batch_size, sequence_length, feature_dim] → [3, 10, 1]
PyTorch Code:
import torch
# Historical prices for 3 stocks, each with 10 days of data
stock_price_data = [
[[101.2], [102.5], [103.0], [104.2], [105.1], [106.0], [107.3], [108.5], [109.0], [110.1]],
[[210.0], [211.2], [212.3], [213.4], [214.1], [215.0], [215.9], [216.8], [217.5], [218.0]],
[[310.5], [309.2], [308.8], [307.1], [306.5], [305.2], [304.0], [303.8], [303.0], [302.1]]
]
# Convert to PyTorch tensor
tensor_data = torch.tensor(stock_price_data, dtype=torch.float32)
print(tensor_data.shape) # Output: torch.Size([3, 10, 1])
print(tensor_data)
This tensor can now be used as input to a model like torch.nn.LSTM or torch.nn.Transformer which expects a 3D input tensor in the format:
[batch_size, sequence_length, feature_dim]
Tensor Dimensions and Shape
Dimension (or Rank)
The dimension of a tensor (also called its rank) refers to the number of axes or indices required to access an element. It answers the question:
How many nested levels of data are there?
Example | Tensor | Dimension |
Scalar | 5 | 0D |
Vector | [5.1, 6.2, 7.3] | 1D |
Matrix | [[1, 2], [3, 4]] | 2D |
3D Tensor | [[[1], [2]], [[3], [4]]] | 3D |
Shape
The shape of a tensor is a tuple that describes how many elements exist along each dimension (axis). It answers:
How big is each axis?
Tensor Example | Shape |
[1.2, 3.4, 5.6] | (3,) |
[[1, 2], [3, 4]] | (2, 2) |
A batch of 3 sequences, 10 time steps, 1 feature | (3, 10, 1) |
PyTorch Example
import torch
tensor = torch.randn(3, 10, 1)
print(tensor.ndim) # Output: 3 (dimension/rank)
print(tensor.shape) # Output: torch.Size([3, 10, 1])
This example creates a 3D tensor representing 3 samples, each with 10 time steps and 1 feature (e.g., for stock price prediction).
Summary
Term | Definition | Example |
Dimension | Number of axes (rank) | 3 for shape (3, 10, 1) |
Shape | Size along each axis | (3, 10, 1) |
Rules for Matrix Multiplication
In linear algebra and in PyTorch, two matrices A and B can be multiplied using matrix multiplication if:
- The number of columns in A equals the number of rows in B.
If A has shape (m, n) and B has shape (n, p), then the result C = A × B has shape (m, p).
Valid Example
import torch
A = torch.randn(2, 3) # shape: (2 rows, 3 columns)
B = torch.randn(3, 4) # shape: (3 rows, 4 columns)
C = torch.matmul(A, B) # result shape: (2, 4)
print(C.shape) # Output: torch.Size([2, 4])
Invalid Example (Will Raise Error)
A = torch.randn(2, 3)
B = torch.randn(2, 4)
# This will raise a runtime error:
C = torch.matmul(A, B) # incompatible shapes: (2,3) x (2,4)
Error Reason: The number of columns in A (3) does not match the number of rows in B (2).
Batch Matrix Multiplication
PyTorch also supports batched matrix multiplication using tensors with more than 2 dimensions.
A = torch.randn(10, 2, 3) # batch of 10 matrices of shape (2, 3)
B = torch.randn(10, 3, 4) # batch of 10 matrices of shape (3, 4)
C = torch.matmul(A, B) # output: (10, 2, 4)
print(C.shape)
Element-wise Multiplication (Not Matrix Multiplication)
If you want to multiply element by element (Hadamard product), use * instead:
A = torch.randn(2, 3)
B = torch.randn(2, 3)
C = A * B # element-wise multiplication
torch.nn.Linear in PyTorch
torch.nn.Linear() is a PyTorch module that applies a linear transformation to the incoming data:
Y = XWT + b
This is equivalent to a fully connected (dense) layer in a neural network.
Function Signature
torch.nn.Linear(in_features, out_features, bias=True)
- in_features: Size of each input sample (number of input features)
- out_features: Size of each output sample (number of neurons)
- bias (optional): If True, includes a learnable bias term
Example
import torch
import torch.nn as nn
# Create a Linear layer: 3 input features → 2 output features
linear = nn.Linear(in_features=3, out_features=2)
# Sample input: batch of 4 samples, each with 3 features
x = torch.randn(4, 3)
# Apply the linear transformation
output = linear(x)
print(output.shape) # Output: torch.Size([4, 2])
What It Does Internally
The nn.Linear module initializes:
- A weight matrix of shape (out_features, in_features)
- An optional bias vector of shape (out_features)
Then it computes: output = x @ weight.T + bias
Accessing Parameters
print(linear.weight.shape) # torch.Size([2, 3])
print(linear.bias.shape) # torch.Size([2])
Common Use in Models
class SimpleNN(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 1)
def forward(self, x):
return self.fc(x)
Here, a linear layer maps an input of 10 features to a single output (e.g., a regression prediction).
PyTorch Tensor Manipulation Methods
These methods are commonly used when reshaping or reordering tensors to make them compatible with model inputs, layers, or batch-processing operations.
torch.reshape(input, shape)
Reshapes a tensor to the specified shape, if the number of elements matches. The returned tensor has the same data but arranged differently.
x = torch.randn(2, 3)
y = torch.reshape(x, (3, 2))
Why use it? To prepare data for specific layers or operations. For example, flattening image data from (batch, channels, height, width) to (batch, -1) before feeding it into a linear layer.
Tensor.view(shape)
Similar to reshape, but view returns a view of the original tensor — no copy is made. The shape must be compatible with the underlying data layout.
x = torch.randn(2, 3)
y = x.view(3, 2)
Why use it? For memory efficiency when reshaping without creating a new tensor. Common in model forward passes and feature flattening.
torch.stack(tensors, dim=0)
Stacks a sequence of tensors along a new dimension. All tensors must be the same shape.
a = torch.tensor([1, 2])
b = torch.tensor([3, 4])
c = torch.stack([a, b], dim=0) # shape: (2, 2)
Why use it? To batch multiple tensors into a single tensor for parallel processing (e.g., stacking multiple samples for batch inference).
torch.squeeze(input)
Removes all dimensions with size 1 from the tensor.
x = torch.randn(1, 3, 1, 4)
y = torch.squeeze(x) # shape: (3, 4)
Why use it? To clean up unnecessary dimensions after operations like broadcasting or loading single-sample data (e.g., remove singleton batch or channel dimensions).
torch.unsqueeze(input, dim)
Inserts a new dimension of size 1 at the specified dim.
x = torch.tensor([1, 2, 3]) # shape: (3,)
y = torch.unsqueeze(x, 0) # shape: (1, 3)
Why use it? To add batch or channel dimensions before feeding into models. Useful for aligning tensor shapes for broadcasting or convolutional layers.
torch.permute(input, dims)
Rearranges the dimensions of the tensor according to dims.
x = torch.randn(2, 3, 4) # shape: (batch, channels, time)
y = torch.permute(x, (0, 2, 1)) # shape: (batch, time, channels)
Why use it? To reorder axes for compatibility with layers or operations (e.g., when converting between (batch, channels, height, width) and (batch, height, width, channels)). Essential in attention models and image processing.
Converting Between NumPy and PyTorch
NumPy ↔ PyTorch Conversion Methods
PyTorch and NumPy are both popular libraries for numerical computing, and it’s often necessary to convert data between them. Below are the two primary methods you’ll use:
torch.from_numpy(ndarray)
Converts a NumPy array into a PyTorch tensor. The returned tensor and the original NumPy array will share the same memory, so modifying one affects the other.
import numpy as np
import torch
np_array = np.array([[1.0, 2.0], [3.0, 4.0]])
torch_tensor = torch.from_numpy(np_array)
print(torch_tensor)
# tensor([[1., 2.],
# [3., 4.]], dtype=torch.float64)
Why use it? When you're preprocessing data or loading datasets with NumPy, but need to work with PyTorch models. This method is efficient since no data is copied.
Note: The NumPy array must be of a supported data type (e.g., float32, float64, int64). Unsupported types like int8 may raise errors or produce unexpected behavior.
torch_tensor.numpy()
Converts a PyTorch tensor into a NumPy array. Like the previous method, this returns a shared-memory view — changing one will change the other if the tensor is on the CPU.
tensor = torch.tensor([[10.0, 20.0], [30.0, 40.0]])
np_array = tensor.numpy()
print(np_array)
# array([[10., 20.],
# [30., 40.]], dtype=float32)
Why use it? When you're done with PyTorch computations and want to analyze or visualize the data using NumPy-based libraries like matplotlib or pandas.
Important: You can only call .numpy() on tensors that reside on the CPU. If the tensor is on a GPU, you'll need to move it first:
tensor = tensor.cuda() # move to GPU
tensor.numpy() # ❌ will raise error
tensor.cpu().numpy() # ✅ correct way
Reproducibility in Neural Networks and Deep Learning
Reproducibility is a critical concept in deep learning and neural network research.
It refers to the ability to obtain the same results when an experiment is repeated under the same conditions.
This includes training the same model architecture on the same dataset with identical hyperparameters, random seeds, and hardware environment.
In deep learning, reproducibility is often more challenging than in traditional software because of the inherent randomness in model training. Common sources of non-determinism include:
- Random weight initialization
- Shuffling of data during training
- Parallel computation (e.g., GPU kernels executing in non-deterministic order)
- Floating point precision and hardware-level operations
To promote reproducibility, deep learning practitioners typically set random seeds for all relevant libraries. For example, in PyTorch:
import torch
import numpy as np
import random
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
These steps minimize randomness but can reduce training performance due to disabled optimizations. Therefore, there's often a tradeoff between reproducibility and speed.
Reproducibility also includes documenting and version-controlling all aspects of an experiment: code, data, environment, and configuration. Tools like MLflow, Weights & Biases, and DVC (Data Version Control) help track experiments and make them repeatable.
In collaborative or production environments, reproducibility ensures that models can be validated, audited, and improved systematically. For example, a machine learning model used in finance or healthcare must be reproducible to satisfy regulatory requirements.
Detailed Explanation of Key PyTorch Components for Creating Models
When building neural networks in PyTorch, it's important to understand the core modules and their responsibilities. Below is a detailed breakdown of key elements commonly used in PyTorch model design:
1. torch.nn
What it does: This is PyTorch’s core neural network library. It provides all of the building blocks necessary for constructing computational graphs, which are sequences of operations that define a model.
It includes layers like Linear
, Conv2d
, RNN
, and functions like ReLU
, Dropout
, and BatchNorm
. These components can be composed to define the structure and forward flow of a neural network.
2. torch.nn.Parameter
What it does: A wrapper for tensors that tells PyTorch to automatically register them as trainable parameters when used inside a nn.Module
.
If requires_grad=True
, PyTorch will track operations on the tensor and compute gradients during backpropagation. This is critical for autograd (automatic differentiation), allowing weights to be updated during training.
Example:
self.weight = nn.Parameter(torch.randn(10, 5))
3. torch.nn.Module
What it does: This is the base class for all neural networks in PyTorch. When you define a custom model, you create a subclass of nn.Module
.
It handles:
- Registration of parameters and sub-modules
- Device management (e.g., moving to GPU)
- Serialization (saving/loading models)
Every subclass must implement a forward()
method, which defines the computation that the model performs.
Example:
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.linear = nn.Linear(10, 1)
def forward(self, x):
return self.linear(x)
4. torch.optim
What it does: This module provides a collection of optimization algorithms (e.g., SGD
, Adam
, RMSprop
) that control how the model's parameters are updated based on gradients computed from the loss function.
You pass the model's parameters (usually via model.parameters()
) to an optimizer and call optimizer.step()
to apply the gradient update.
Example:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss.backward()
optimizer.step()
5. def forward()
What it does: Every nn.Module
subclass must define this method. It specifies the **computation performed by the model** when data is passed through it.
Think of it as the **pipeline of operations** that the input data goes through — e.g., linear transformations, activations, normalizations, etc.
Example:
def forward(self, x):
x = self.linear1(x)
x = F.relu(x)
x = self.linear2(x)
return x
Detailed Explanation of the PyTorch Training Loop
A training loop in PyTorch consists of several sequential steps that allow a model to learn from data. Each step serves a specific role in computing predictions, evaluating errors, and updating model parameters. Below is a more detailed breakdown of the five core steps in the training process:
1. Forward Pass
What it does: The model receives a batch of input data and processes it through its forward()
method. Each layer applies its transformation (e.g., matrix multiplication, activation functions), and the final output is produced.
Code: y_pred = model(x_train)
Purpose: This step generates predictions based on the current model parameters. No weights are updated here—this is just inference with the current state of the model.
2. Calculate the Loss
What it does: Compares the model’s predictions to the ground truth labels using a loss function (e.g., Mean Squared Error, Cross Entropy). This gives a scalar value representing how far off the predictions are.
Code: loss = loss_fn(y_pred, y_train)
Purpose: Quantifies the model’s error. This value is critical for guiding how the model should update its parameters to improve predictions.
3. Zero Gradients
What it does: Clears old gradients from the previous training step. In PyTorch, gradients accumulate by default, so this step is necessary to avoid mixing gradients from different steps.
Code: optimizer.zero_grad()
Purpose: Ensures that gradient calculations are isolated to the current training batch. Without this, gradients from previous steps would corrupt the update.
4. Perform Backpropagation on the Loss
What it does: Computes the gradient of the loss with respect to all parameters in the model that have requires_grad=True
. PyTorch’s autograd engine traces the operations from the forward pass and applies the chain rule backward.
Code: loss.backward()
Purpose: Produces gradients that tell the model how to change each parameter to reduce the loss. This step sets up the optimizer for the actual parameter update.
5. Update the Optimizer (Gradient Descent)
What it does: Applies the gradients to the model's parameters. This is where the learning happens — each parameter is nudged in the direction that reduces the loss.
Code: optimizer.step()
Purpose: Updates the model weights using the gradients computed during backpropagation. The learning rate and optimization algorithm (e.g., Adam, SGD) define how big the update step is.
torch.inference_mode()
What is it?
torch.inference_mode() is a context manager and decorator introduced in PyTorch 1.9 designed to optimize inference (model evaluation) by disabling autograd and additional internal tracking for better speed and memory efficiency.
Purpose
- Disable gradient computation: Prevents building the computation graph, so no gradients are tracked or computed.
- Disable version counter updates: Unlike torch.no_grad(), it also disables tracking of tensor version counters, reducing overhead.
- Improve inference performance: Cuts down on autograd and internal bookkeeping to speed up forward passes and reduce memory usage during inference.
Why use it?
- When running a trained model for inference only (no training, no backpropagation).
- To reduce memory footprint.
- To gain faster inference speed compared to torch.no_grad().
How to use it?
with torch.inference_mode():
output = model(input_tensor)
Or as a function decorator:
@torch.inference_mode()
def infer(model, input_tensor):
return model(input_tensor)
Under the hood
inference_mode() sets an internal flag that disables autograd and avoids tracking version counters. This can lead to memory savings, especially for complex models or large inputs.