Tensors are generalizations of vectors and matrices to higher dimensions. It's a mathematical object that can be represented as a multi-dimensional array of numbers and transforms:
The rank of a tensor refers to the number of indices (dimensions) required to uniquely identify each element of the tensor. E.g., rank-3 tensors \(T_{ijk}\).
Notation:
\(T_{i_{1},i_{2}..i_{n}}\)
Tensor contraction: Reducing the rank of a tensor by summing over one or more pairs of indices (e.g., summing \(T_{ij}\) over \(j\)).
Consider a \(2\)-rank tensor, represented as a matrix \(A\) with components \(A_{ij}\). The contraction involves summing over one pair of indices, such as \(i=j\).
\( A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \)
The contraction over \(i\) and \(j\) (i.e., \(\sum_i A_{ii}\)) results in the trace of the matrix:
\( \text{Tr}(A) = A_{11} + A_{22} \)
For a 3D tensor \(T_{ijk}\), we can contract over any pair of indices. For example, contracting over \(i\) and \(j\):
\( \sum_{i=j} T_{iik} \)
This reduces \(T_{ijk}\), which has rank \(3\), to a tensor \(T_k\), which has rank \(1\) (a vector).
Given \(A=[[1,2],[3,4]]\) and \(B=[[5,6],[7,8]]\)
\(C=multiply(A,B)=[[1 \cdot 5,2 \cdot 6],[3 \cdot 7,4 \cdot 8]]=[[5,12],[21,32]]\)
In Python:
numpy.multiply(A,B)
A perceptron is a single-layer neural network. It is the simpliest type of artificial neural network and is a foundational building block in machine learning.
Mathematical Representation of the perceptron:
\(y=f(\sum_{i=1}^n w_i x_i + b)\)
A typical activation function could be \(f(x)=\{x>0:1, x<0:0\}\)
The perceptron doesn't have the capacity for complex patterns or nonlinear decision boundaries. A MLP with sufficient hidden neurons an approximate any continuous function.
A multi-layer perceptron (MLP) is an extension of the perceptron that:
Hidden layers are an architecture element of MLP consisting in one or more intermediate layers that process the inputs.
The networks that have multiple hidden layers are named deep neural networks.
Mathematical representation for a single hidden layer:
\(z_j=f(\sum_{i=1}^{n}w_{ji}x_i+b_j)\)
\(y_k=g(\sum_{j=1}^{m}v_{kj}z_j+c_k)\)
Two hidden layers example: \(z=f(input)\)
\(t=g(z)\)
\(output=h(t)\)
Applications:


| Transfer function | Formula | Output range | Usage | Limitations | 
|---|---|---|---|---|
| Linear | \(f(x)=x\) | Output layers for regression tasks | Cannot model non-linear relationships | |
| Sigmoid | \(f(x)=\frac{1}{(1+e^{-x})}\) | 0,1 | Probabilities in binary classification tasks | - Not zero-centered (slows down convergence); - Vanishing gradient: outputs are in the range 0,1 and derivatives are very small fro inputs far from 0 | 
| Tanh | \(f(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}\) | -1,1 | Zero-centered (spped up convergence) | Vanishing gradient: outputs are in the range -1,1 and derivatives are very small for large positive or negative inputs. Used occasionally but often replaced by ReLU | 
| ReLU (Rectified Linear Unit) | \(f(x)=max(0,x)\) | Computationally efficient. Solve vanishing gradient for positive inputs. Default choice for most hidden layers (included variants). | Can cause dead neurons if weights are initialized poorly leading to zero gradients | |
| Leaky ReLU | \(f(x)=\{x>0: f(x)=x, x \le 0: f(x)=\alpha x\}\) with \(\alpha>0\) | Addresses the dead neuron problem by allowing small gradients for negative inputs | ||
| Softmax | \(f(x_i)=\frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}\) where \(i\) is the class | Used in output layers for multi-class classfication | ||
| SELU (Scaled Exponential Linear Unit) | Designed for self-normalizing networks. Maintain mean and variance during training | 
Weight initialization is important in deep learning to:
Weight initializations:
Choosing the right weight initialization and transfer function is critical for effective neural network training.
Optimization algorithm used to minimize a loss function (i.e., the difference between predicted and actual values).
Steps:
\(\theta \leftarrow \theta - \lambda \cdot \nabla J(\theta)\)
Then repeat the process until convergence (loss function reaches a minimum).
Variants of Gradient descend:
The learning algorithm used by MLP to adjust weights using the error gradient.
Steps:
Key Concepts:
\(W \leftarrow W - \lambda \frac{\partial L}{\partial W}\)
\(b \leftarrow b - \lambda \frac{\partial L}{\partial b}\)
This process trains the network by iteratively reducing the loss.
The vanishing gradient problem happens more often in networks with many layers.
ReLU help prevent the vanishing gradient problem for positive inputs. ReLU neurons can stop learning if their inputs are always negative.
Example of a vanishing gradient in a 3-layer neural network with the sigmoid activation function:
\(Input \rightarrow Hidden layer 1 \rightarrow Hidden layer 2 \rightarrow Output\)
Derivative of the sigmoid (calcuated in the Machine Learning sheet): \(\frac{d\phi(z)}{dz}=\phi(z)(1-\phi(z))\)
This derivative (gradient) will assume the values \(0\) - \(0,25\). Example: \(0.5 \cdot (1-0.5)=0.5 \cdot 0.5=0.25\)
Assuming that the network is performing backpropagation to update the weights, the gradient of the loss \(L\) with respect to a weight \(W\) in a specific layer would be the product of gradients from all previous layers, due to the chain rule: \(\frac{dL}{W}=\)=gradient of layer 2 \(\cdot\) gradient of alyer 3 \(\cdot\) ... \(\cdot\) gradent of layer n.
Taking the example of the gradient above, the gradients (derivatives of the sigmoid) at each layer would be:
Using the chain rule, the gradient calculated at Layer 1:
\(0.25 \cdot 0.25 \cdot 0.25 = 0.015\)
In deeper networks (e.g. 10+ layers) multiplying many small gradients (values \(<1\)) would lead to gradient approaching \(0\).
When gradient approaches \(0\), the weights in the earlier layers stop updating effectively during backpropagation. The network cannot learn effectively because weights updates (proportional to the gradient) become negligible.
The ReLU activation function would solve for positive values because if \(x>0\): \(\frac{d max(0,x)}{dx}=\frac{dx}{dx}=1\)
Feed-forward networks are neural network consitent in layers of neurons where information flows in one direction (from the input layer, through hidden layers, to the output layer) and there are no cycles or loops. The main advantage is that they are easy to implement and understand.
Limitations:
Applications:
Convolutional networks are a specialized class of artificial neural networks designed to process and analyze data with a grid-like topology, such as images.
The purpuse of convolutional networks is to detect the local patterns in the input, like edges and textures.
ReLU activation function is commonly used in the hidden layers of convolutional networks. It helps the network to learn non-linear relationships between the features in the image.
A Convolutional layer apllies to the input a set of filters (kernels) producing fetature maps. Each filter slides across the input (using a defined stride), computing the dot product between the filter and overlapping regions of the input.
There are 3 types of layers of the convolutional neural networks:
Conventionally, the first convolutional layer is responsible for capturing low-level features such as edges, color and gradient orientation.
\((f * g)(t) = \int_{\infty}^\infty f(\tau)g(t-\tau)d\tau\)
Given a dose plane of [3 (day 1), 2 (day 2), 1 (day 3)] and a number of patient list of [1 starting on day 1, 2 starting on day 2, 3.., 4.., 5..], the following steps describe the doses needed every day since day \(1\) to day \(5+(3-1)\).
Remember that the signal list is flipped before sliding it along the program (doses).
Day 1
\([\_ \hspace{0.3cm} \_ \hspace{0.3cm} \_ \hspace{0.3cm} \_ \hspace{0.3cm} 3]\)
\([\cdot \hspace{0.3cm} \cdot \hspace{0.3cm} \cdot \hspace{0.3cm} \cdot \hspace{0.3cm} \cdot]\)
\([5 \hspace{0.3cm} 4 \hspace{0.3cm} 3 \hspace{0.3cm} 2 \hspace{0.3cm} 1]\)
\([= \hspace{0.2cm} = \hspace{0.2cm} = \hspace{0.2cm} = \hspace{0.2cm} =]\)
\([\_ \hspace{0.3cm} \_ \hspace{0.3cm} \_ \hspace{0.3cm} \_ \hspace{0.3cm} 3]\)
\(=3\)
Day 2
\([\_ \hspace{0.3cm} \_ \hspace{0.3cm} \_ \hspace{0.3cm} 3 \hspace{0.3cm} 2]\)
\([\cdot \hspace{0.3cm} \cdot \hspace{0.3cm} \cdot \hspace{0.3cm} \cdot \hspace{0.3cm} \cdot]\)
\([5 \hspace{0.3cm} 4 \hspace{0.3cm} 3 \hspace{0.3cm} 2 \hspace{0.3cm} 1]\)
\([= \hspace{0.2cm} = \hspace{0.2cm} = \hspace{0.2cm} = \hspace{0.2cm} =]\)
\([\_ \hspace{0.3cm} \_ \hspace{0.3cm} \_ \hspace{0.3cm} 6 \hspace{0.3cm} 2]\)
\(=8\)
Elements:
A valid padding is when we perform convolutional network without padding and we are presented with a matrix that has dimension of the Kernel (3x3x1).
In the same padding when we augment the 5x5x1 image into a 6x6x1 image and then apply the 3x3x1 kernel over it, we find that convolved matrix turns out to be of dimensions 5x5x1.
This layer applies some aggregation operations to reduce the dimension of the feature map (convoluted matrix).
The benefits of reducing dimension of the feature map are:
Average pooling returns the average of all the values from the portion of the image covered by the kernel. It simply performs dimensionality reduction as a noise-suppressing mechanism.
Max pooling returns the maximum value from the portion of the image covered by the Kernel. It also performs as noise suppressant:
In convolutional networks, a fully connected layer convert input image into a suitable form for MLP. It flatten the image into a column vector. Then flattened output is fed to a feed-forward neural network and backpropagation is applied to every iteration of training.
Basically, it flattens the final output and feed it to a regular Neural Network for classification purposes.
Recurrent neural networks process sequential data by maintaining a hidden state that serves as a form of memory.
Main challanges:
\(h_t=f(W_h h_{t+1}+W_x x_t+b)\)
Special structures that help RRT to store and retrieve information over longer sequences.
The new cell state combines the old state (filtered by the forget gate) and the candidate state (regulated by the input gate):
\(C_t=multiply(f_t,C_{t-1})+multiply(i_t,\tilde{C_t})\)
Where:
Hidden state is updated with element-wisde multiplication of output gate and cell state:
\(h_t=multiply(o_t,tanh(C_t))\)
Where:
The goal of mapping the cell state to a bounded range -1,1 is to allow non-linear transformations, making it suitable for tasks like sequence modeling and memory representation.
In LSTM, the output of each gate is a matrix of weights in range 0,1 which are multplied element-wise with the relevant tensors to modulate flow of information.

Applications:
Limitations:
Generative Adversarial Networks (GANs) are a class of machine learning models that are widely used for generating new data that closely resembles a given dataset. They were introduced by Ian Goodfellow and his collaborators in 2014.
GANs consist of two neural networks:
Generator (G):
Discriminator (D):
These two networks are trained simultaneously in a zero-sum game:
The competition between these two networks drives both to improve over time.
Objective function. The goal of the GAN is to optimize the following minimax game:
\(\min_G \max_D \mathbb{E}_{x \sim p_\text{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\)
The generator minimizes the probability of the discriminator correctly identifying fake data, while the discriminator maximizes its classification accuracy.
Applications:
The generator learns to output samples from only a few of the classes present in the training data (example of digits 0-9: use always 1 or 7 because are similar).
Once a GAN has successfully been trained, we can discard the discriminator and use the generator to create new samples. We do this simply by feeding it any input vector from the same noise distribution we drew from during training. The space from which we sample these vectors is called the latent space.
One fascinating aspect of GANs is that we can often perform highly interpretable arith metic on vectors sampled from the latent space. For example, consider a GAN trained to generate images of faces. Experiments, such as those performed by Radford et al. (2015), show that, if \(G(z^(1))\) depicts a woman with glasses, \(G(z^2)\) depicts a woman without glasses and \(G(z^3)\) depicts a man without glasses, then \(G(z^1) − G(z^2) + G(z^3)\) is an image of a man with glasses.
We can also use the latent space to interpolate between different generator outputs. For example, if we depict the output obtained by sampling along the line between two latent vectors, it results in a smooth deformation from one face to another.
Conditional GANs (cGANs) are a type of Generative Adversarial Network (GAN) that introduce conditioning variables to guide the data generation process. Unlike traditional GANs, which generate data without any control over the output, cGANs allow for the generation of specific types of data based on additional input information.
In cGANs, both the generator and the discriminator are conditioned on auxiliary information \(y\), which could be:
The generator take as input a random noise vector \(z\) and the condition \(y\), and generates data \(G(z, y)\).
The discriminator receives both real or fake data and the condition \(y\), and learns to determine if the data corresponds to the given condition.
The objective function for cGANs is modified to incorporate the conditioning variable \(y\):
\(\min_G \max_D = \mathbb{E}_{x \sim p_\text{data}(x)}[\log D(x, y)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z, y), y))]\)
Autoencoders are neural networks whose superficial task is to learn the identity function \(fx=x\). In other words, autoencoders learn to reconstruct their input. The real purpose of the autoencoder, however, is to learn an efficient representation of the input data that, ideally, preserves only the information required to obtain a sufficiently faithful reconstruction.
Consider, for example, a data set of images of a fixed size consisting of a single solid color. Supposing we know the size of the images, we only need the three numbers corre sponding to the red, green, blue (RGB) value of the color to reconstruct the image. An autoencoder is capable of learning such a representation for the images. In much more complicated settings, autoencoders are high-level tools for dimensionality reduction, i.e., extraction of the most important features in the input data. Applications include data compression, denoising, and upscaling.
They consist of two main parts:
Mathematically, the encoder can be represented as:
\(z = f(x; \theta)\)
where \(x\) is the input data, \(z\) is the latent representation, and \(\theta\) are the parameters (weights and biases) of the encoder.
Mathematically, the decoder can be represented as:
\(\hat{x} = g(z; \phi)\)
where \(z\) is the latent representation, \(\hat{x}\) is the reconstructed data, and \(\phi\) are the parameters of the decoder.
 
Loss Function: The training objective of an autoencoder is to minimize the reconstruction loss, which measures the difference between the input \(x\) and the reconstructed output \(\hat{x}\). Common loss functions include:
Unsupervised Learning: Autoencoders don't require labeled data since the goal is to reconstruct the input itself.
Denoising Autoencoders (DAE):
Sparse Autoencoders:
Variational Autoencoders (VAE):
Contractive Autoencoders:
Convolutional Autoencoders:
Applications:
Challanges:
Direct Feedback Alignment is an alternative to backpropagation, aimed at reducing the computational dependency between layers during training. Instead of propagating the error gradients backward through each layer, DFA uses random projections to provide approximate feedback signals directly to each layer.
Challanges:
Synthetic Gradients decouple the training of network layers by generating gradients locally for each layer using a learned model rather than backpropagating gradients from the output loss.
Challanges:
DNIs use synthetic gradients and/or synthetic inputs to break the computational dependencies ("locking") inherent in standard backpropagation. This allows simultaneous training of different layers or modules in a neural network, improving parallelization.
Types of Locking in Backpropagation:
Functionality of DNIs:
Implementation Example:
Training steps:
Advantages:
Challenges:
Applications and Practical Use:
| Aspect | DFA | Synthetic Gradients | Decoupled Interfaces | 
|---|---|---|---|
| Feedback Type | Random, fixed | Learned, adaptive | Interface-driven, local | 
| Coupling | Decoupled feedback per layer | Partially decoupled | Fully decoupled | 
| Biological Plausibility | Moderate | Low | Low | 
| Parallelism | Moderate | High | High | 
| Performance | Often suboptimal compared to BP | Can match BP if gradients are accurate | Dependent on interface design | 
These approaches focus on reducing the computational challenges of backpropagation and enabling more scalable and biologically inspired training paradigms.
In Neural Machine Translation (NMT)
Bidirectional Recurrent Neural Networks (RNNs) process sequential data (e.g., sentences) in both forward and backward directions to capture dependencies from both past and future contexts. Each input word (\(x_t\)) is processed by forward and backward layers, producing an intermediate concatenated vector (\(h_t\)).
Challenges:
Inspired by human translators, attention mechanisms enable a model to focus on the most relevant parts of an input sequence when generating each part of an output sequence. Proposed by Bahdanau et al. (2014), the idea is to dynamically assign "attention weights" to different parts of the input, which helps in translating long sentences.
Architecture:
Encoder-Decoder Model:
Attention Mechanism Components:
Training Attention:
Advantages:
Attention mechanisms have been successfully applied in diverse tasks like image captioning (e.g., Xu et al., 2015) and other sequence-to-sequence problems.
Generative Learning Trilemma: A generative model needs to satisfy three requirements for widespread adoption:
Challenges with GANs:
Diffusion Models:
The diffusion model was introduced in the 2020 paper "Denoising Diffusion Probabilistic Models" by Jonathan Ho and colleagues. The original paper did not claim that diffusion models generated better images than GANs. Many papers have since improved upon the original diffusion model, with some claiming that diffusion models now outperform GANs in image synthesis. The authors of the original paper published another paper showing that their improved diffusion model captured the breadth of training data variance better and beat state-of-the-art GANs in image generation tasks.
FID: the quality of images generated by a model is evaluated using the FID (Frechet Inception Distance) score. A lower FID score indicates better image quality, with a perfect score being 0.0.
Comparison with GANs: The improved diffusion model, known as the Ablated Diffusion Model (ADM), achieved the best FID scores across various datasets, outperforming several GANs.
Diffusion Process:
Forward Diffusion Process: Adding noise to an image in incremental steps until it becomes unrecognizable.
Reverse Diffusion Process:
Training Objective: Train a neural network to approximate the reverse diffusion process, enabling the generation of images from noise.
The forward diffusion process is modeled as a Markov chain.
Formulas:
Neural Network:
Formulas:
\(p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\left(x_{t-1}; \color{red}{\mu_\theta(x_t, t)},\, \color{red}{\Sigma_\theta(x_t, t)}\right)\)
The neural network learns to undo the noise added at each time step. Noise was added incrementally and iteratively in the forward process, allowing the reverse process to function effectively.
The diffusion model can be thought of as a latent variable generator model, similar to a Variational Autoencoder (VAE).
Simple Autoencoders:
Variational Autoencoders (VAEs):
Relation to Diffusion Models: