In 2025, I undertook specialized courses on TensorFlow alongside my Master of Science in Applied Artificial Intelligence to strengthen practical and applied skills in parallel with the program’s theoretical and scientific focus.

Introduction

Similarly to Kaggle, https://colab.research.google.com/ allows to run Tensorflow code on cloud.

Reminder on Numpy performance and why is used for Deep Learning

NumPy achieves high performance by offloading computationally intensive operations to precompiled C and Fortran libraries, allowing it to bypass Python’s interpreter overhead. It uses contiguous, typed memory blocks similar to C arrays, which improve cache efficiency and enable fast, low-level access.

Through vectorization and broadcasting, NumPy performs operations on entire arrays without explicit Python loops, significantly speeding up execution.

Additionally, it integrates with highly optimized libraries like BLAS and LAPACK for linear algebra tasks, further boosting performance while keeping code in high-level Python.

TensorFlow Notes

TensorFlow is a deep learning framework specifically designed for building and deploying neural network models. It is widely used in production environments due to its scalability and robustness.

Comparison with Other Frameworks

  • PyTorch is often preferred for smaller-scale projects or rapid prototyping due to its flexibility and more Pythonic interface. However, it is less commonly used in production at scale.
  • Other specialized platforms for specific use cases include:

    • DeepLearningKit
    • Caffe

Tensor Basics

For an introduction to tensors see My notes about Deep Learning.


TensorFlow Versions

TensorFlow 2.x (Eager Execution)

  • Eager execution is enabled by default, meaning operations are evaluated immediately, making it easier to debug and develop.
  • No need for explicitly defined sessions or placeholders.
  • Encourages a more Pythonic and functional programming style.
  • Suitable for most applications, especially those needing interactive development and debugging.

TensorFlow 1.x (Graph Execution)

  • Uses lazy execution (build-and-run model), where you define the computation graph first and execute it later within a session.
  • Requires defining placeholders, which are variables without assigned values at creation. These are fed during session runs.
  • More efficient for large-scale models and datasets due to static graph optimization.

Creating Graphs in TensorFlow 2

  • You can create graphs using @tf.function decorators, which convert Python functions into TensorFlow computation graphs.
  • This approach blends the benefits of eager execution and static graphs—easy development with performance optimization.

Beyond Deep Learning

While TensorFlow is best known for deep learning, it also supports:

  • Support Vector Machines (SVM)
  • Decision Trees
  • Linear and Logistic Regression

Training Models (Reference: 03_02_Train.ipynb)

  • Learn how to:

    • Build Sequential models.
    • Assign loss functions with model.compile(loss='loss_function').
    • Train models using model.fit() by specifying parameters like epochs, batch_size, etc.
    • Monitor training using returned metrics.

Useful Examples:

  • Plot training loss:

    plt.plot(learning.history['loss'])
  • Access weights of a specific layer (e.g., layer at index 1):

    rn_model.layers[1].weights

    This returns the kernel (weights) and bias values. For a dense layer with one unit, dense/kernel represents the slope, and dense/bias the offset.

Dense Layers in Regression

  • For regression problems, a Dense layer with 1 unit is often sufficient as the output layer.
  • It is important to specify the input shape. For example, an image of size 8×8 should have an input dimension of 64 (input_dim=64).

Saving and Restoring Models

What You Can Save

TensorFlow allows you to save the following components of a model:

  • Model architecture
  • Trained weights
  • Optimizer state
  • Loss function and evaluation metrics

Example

model = tf.keras.Sequential([
    tf.keras.layers.Dense(10),
    tf.keras.layers.Dense(2)
])
model.save('TEST.keras')

This saves the model in the 'TEST.keras' format, stored in a directory.

Restoring the Model

restored_model = tf.keras.models.load_model('TEST.keras')

Note: If you need to use the model with different data types (e.g., float32 vs int), you must either re-train the model on the new data or convert future data to match the model's original data type.


Building a Neural Network

Step 1: Data Preparation

  • Normalize, reshape, or extract features (e.g., target variables) as necessary before feeding data into the model.

Step 2: Choosing a Model Type

  • The Sequential API is user-friendly and ideal when there's a single input and a single output.
  • You can stack various layers such as:

    • Dense: Fully-connected layer
    • Conv2D: Convolutional layer
    • (Add more as needed, e.g., LSTM, Dropout, Flatten)

Step 3: Adding Layers

  • The units parameter defines the number of neurons.
  • input_dim (or input_shape) defines the input size.

Example: Decreasing Number of Neurons

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=8, input_dim=1))
model.add(tf.keras.layers.Dense(units=4))
model.add(tf.keras.layers.Dense(units=1))

Step 4: Compiling the Model

Choose:

  • An optimizer (e.g., 'adam', 'sgd', tf.keras.optimizers.Lion())
  • A loss function (e.g., 'mse' for regression)
model.compile(optimizer='adam', loss='mse')

Step 5: Fitting the Model

Train the model using:

model.fit(x, y, epochs=10)
  • x: input data
  • y: target data
  • epochs: number of training iterations over the dataset

Enhancing Performance

1. Activation Functions

  • Default is linear (identity).
  • Common choice: 'relu' for hidden layers.
model.add(layers.Dense(units=16, activation='relu'))

See deeplearning.md for more on activation functions.

2. Regularization

Add L2 regularization to reduce overfitting:

model.add(layers.Dense(units=8, kernel_regularizer=tf.keras.regularizers.L2(0.1)))

3. Optimizers

  • Default is RMSprop.
  • You can specify other optimizers like:
model.compile(optimizer=tf.keras.optimizers.Lion(), loss='mse')

Batch Normalization

  • Used to stabilize training and reduce overfitting.
  • Normalizes the input to a layer so that it has a mean of 0 and a standard deviation of 1.
  • Place the BatchNormalization layer immediately before the target layer.
model.add(layers.Dense(units=16))
model.add(tf.keras.layers.BatchNormalization())
model.add(layers.Dense(units=8))

Dropout (as Regularization)

  • Randomly disables a fraction of neurons during each training step to prevent overfitting.
  • Place the Dropout layer after the layer it regularizes.
model.add(layers.Dense(units=16))
model.add(layers.Dense(units=8))
model.add(layers.Dropout(rate=0.5))

In this example, 50% of the neurons in the 8-unit layer will be deactivated randomly during each training epoch.

Stochastic Gradient Descent

When using a batch size of 150 during neural network training with backpropagation, each iteration (or step) processes a randomly sampled subset of 150 data points from the dataset. These batches may vary across iterations, especially if shuffling is applied, though some overlap can occur depending on the dataset size and sampling strategy.

In each iteration, the model’s weights are updated based on the average gradient of the loss function computed over the current batch, scaled by the learning rate.

The total number of iterations per epoch is determined by the dataset size and batch size—specifically, it equals ⌈\(N\) / batch_size⌉, where \(N\) is the total number of training examples.

For example, a dataset with 1,000 samples and a batch size of 150 would require 7 iterations per epoch (with the last batch potentially being smaller unless explicitly dropped).

The overall number of iterations in training is then the product of the number of epochs and the iterations per epoch.

Note that one epoch is complete when the model has seen every training sample once (though not necessarily in a single batch).

Stochastic Gradient Descent

The role of the optimizer

In the context of Stochastic Gradient Descent (SGD) and its variants (e.g., mini-batch SGD), the optimizer (such as Adam, RMSprop, or SGD with momentum) plays a critical role in determining how the model's weights are updated using the gradients computed during backpropagation. Without an optimizer, vanilla SGD would apply the same fixed learning rate to all weights, which can lead to slow convergence or instability.

Basic SGD (Stochastic Gradient Descent)

  • Update rule:

\(w_{t+1} = w_t - \eta \cdot \nabla_w \mathcal{L}(w_t)\)

where \(\eta\) is the learning rate and \(\nabla_w \mathcal{L}\) is the gradient of the loss.

  • Limitations:
    • Fixed learning rate for all weights (no adaptation).
    • Sensitive to noisy or sparse gradients.

Advanced Optimizers (e.g., Adam, RMSprop, Momentum)

These introduce adaptive mechanisms to address SGD’s limitations:

  1. Momentum-Based (e.g., SGD with Momentum, Nesterov)
    • Idea: Accumulate a moving average of past gradients to dampen oscillations.
    • Effect: Faster convergence through valleys/sharp minima by adding "inertia."
  2. Adaptive Learning Rates (e.g., Adam, RMSprop, Adagrad)

    • Idea: Scale the learning rate per parameter based on historical gradient magnitudes.
    • Adam (most popular): Combines momentum (1st moment) and adaptive learning rates (2nd moment) with bias correction.

    \(w_{t+1} = w_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\),

    where \(\hat{m}_t\) is momentum and \(\hat{v}_t\) scales the learning rate adaptively.

  3. Second-Order Methods (e.g., L-BFGS)
    • Use curvature (Hessian) information for updates (rare in deep learning due to computational cost).

The optimizer decides how to use gradients to update weights, introducing mechanisms like momentum or parameter-specific learning rates to make SGD more efficient and stable. Adam is often the default choice due to its adaptivity, but vanilla SGD (with tuning) can still outperform it in certain scenarios (e.g., training LLMs).

Lottery Ticket Hypothesis with Tensorflow

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential, clone_model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt
import time

# 1. Load and preprocess data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# 2. Define the original model
def create_model():
    model = Sequential([
        Flatten(input_shape=(28, 28)),
        Dense(400, activation='relu'),
        Dense(300, activation='relu'),
        Dense(10, activation='softmax')
    ])
    return model
# 3. Initialize and train the original model
original_model = create_model()
initial_weights = original_model.get_weights()  # Save the initial random weights
original_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

original_model_learning = original_model.fit(x_train, y_train, epochs=4, batch_size=128, validation_split=0.1)
Epoch 1/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 14ms/step - accuracy: 0.8731 - loss: 0.4482 - val_accuracy: 0.9650 - val_loss: 0.1220
Epoch 2/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 11ms/step - accuracy: 0.9712 - loss: 0.0939 - val_accuracy: 0.9727 - val_loss: 0.0854
Epoch 3/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - accuracy: 0.9828 - loss: 0.0571 - val_accuracy: 0.9785 - val_loss: 0.0738
Epoch 4/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.9881 - loss: 0.0382 - val_accuracy: 0.9770 - val_loss: 0.0837
# 4. Prune the model (simple magnitude-based pruning)
def prune_weights(model, pruning_percent=0.5):
    weights = model.get_weights()
    new_weights = []
    for w in weights:
        if len(w.shape) > 1:  # only prune dense layer weights
            threshold = np.percentile(np.abs(w), pruning_percent * 100)
            mask = np.abs(w) > threshold
            w = w * mask  # zero out the small weights
        new_weights.append(w)
    return new_weights

pruned_weights = prune_weights(original_model, pruning_percent=0.5)
# 5. Reinitialize model with original random weights and apply the mask (winning ticket)
winning_ticket_model = create_model()
winning_ticket_model.set_weights(initial_weights)
masked_weights = prune_weights(winning_ticket_model, pruning_percent=0.4)
winning_ticket_model.set_weights(masked_weights)
# 6. Retrain the pruned model (winning ticket)
winning_ticket_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
winning_ticket_model_learning = winning_ticket_model.fit(x_train, y_train, epochs=4, batch_size=128, validation_split=0.1)
Epoch 1/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 14ms/step - accuracy: 0.8732 - loss: 0.4575 - val_accuracy: 0.9715 - val_loss: 0.0985
Epoch 2/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 11ms/step - accuracy: 0.9701 - loss: 0.1006 - val_accuracy: 0.9773 - val_loss: 0.0757
Epoch 3/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - accuracy: 0.9824 - loss: 0.0573 - val_accuracy: 0.9795 - val_loss: 0.0659
Epoch 4/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.9882 - loss: 0.0391 - val_accuracy: 0.9810 - val_loss: 0.0705
# 7. Evaluate both models
print("Original model performance:")
original_model.evaluate(x_test, y_test)

print("Winning ticket model performance:")
winning_ticket_model.evaluate(x_test, y_test)
Original model performance:
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9677 - loss: 0.1016
Winning ticket model performance:
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9738 - loss: 0.0804

[0.07047921419143677, 0.977400004863739]
# Get the weights of the layer (Dense(100, activation='relu'))
print(original_model.layers[1].get_weights()[0].shape)
original_model.layers[1].get_weights()[0]  # Shape: (input_dim, output_dim)
(784, 400)

array([[-0.05667121,  0.02065124, -0.04778236, ...,  0.05302081,
        -0.06146147,  0.00401637],
       [ 0.00472874,  0.00062283, -0.06726848, ..., -0.01869787,
         0.00495344,  0.01626997],
       [-0.06472269,  0.06735108,  0.01358175, ...,  0.00604779,
        -0.04576658,  0.02837193],
       ...,
       [ 0.05708074,  0.05192086,  0.04327519, ..., -0.05372921,
         0.06987718,  0.01477351],
       [ 0.05564093, -0.06744465, -0.01727122, ..., -0.03752626,
         0.01364662,  0.01879197],
       [ 0.04817592,  0.00282612, -0.04398317, ...,  0.04413202,
        -0.01568541, -0.02512715]], dtype=float32)
# Get the weights of the layer
weights = winning_ticket_model.layers[1].get_weights()[0]  # Shape: (input_dim, output_dim)

# Create a boolean mask for weights == 0
zero_mask = (np.abs(weights) == 0)

# Optional: Count the percentage of zeros
sparsity = np.mean(zero_mask) * 100
print(f"Sparsity: {sparsity:.2f}% of weights are exactly zero")

# Visualize the mask (e.g., as a heatmap)
import matplotlib.pyplot as plt
plt.imshow(zero_mask, cmap='gray', interpolation='none')
plt.title("Mask of Zero Weights (White = Zero)")
plt.colorbar()
plt.show()
Sparsity: 4.46% of weights are exactly zero

png

Generative Models Overview

Discriminative vs. Generative Models

  • Discriminative models learn to distinguish between different classes or categories. They model the decision boundary between classes and are typically used for classification tasks.
  • Generative models, on the other hand, learn the underlying distribution of the data to generate new, similar data points. They describe how data is generated in a probabilistic manner.

Key Characteristics of Generative Models

  • Accept random noise as input and produce new, unique samples.
  • Typically unsupervised (do not require labeled data).
  • Probabilistic rather than deterministic—randomness enables the generation of diverse outputs.
  • Applications include:
    • Data augmentation for imbalanced datasets.
    • Imputation of missing values.
    • Anonymization of sensitive data (producing realistic but non-identifiable samples).

Two Major Types Covered

  • Generative Adversarial Networks (GANs) – Introduced by Ian Goodfellow in 2014.
  • Denoising Diffusion Probabilistic Models (DDPMs) – Introduced by Jonathan Ho et al. in 2020.

Generative Adversarial Networks (GANs)

Core Concept

GANs consist of two neural networks in competition:

  • Generator (G): Learns to produce realistic data (e.g., images) from random noise.
  • Discriminator (D): Learns to distinguish between real data and data generated by the generator.

This setup forms a zero-sum game where:

  • The generator tries to fool the discriminator.
  • The discriminator tries to correctly identify real vs. fake data.

Training Dynamics

  • Initially, the discriminator easily detects fake data.
  • Over time, the generator improves, producing more realistic samples.
  • Ideally, training reaches a point where the discriminator cannot distinguish real from fake (50% accuracy).

Applications

  • Image generation (e.g., human faces, cartoons).
  • Text-to-image translation (e.g., generating images from descriptions).
  • Image inpainting (e.g., filling in missing or blurred parts).
  • 3D object generation from 2D images.

Denoising Diffusion Probabilistic Models (DDPMs)

Core Concept

  • DDPMs generate high-quality images through a two-step process:
    1. Forward diffusion: Gradually adds noise to an image over several steps.
    2. Reverse diffusion: A neural network learns to reverse this process, removing noise step-by-step to reconstruct the image.

Key Features

  • Capable of producing highly detailed and realistic images.
  • Do not require labeled data—also an unsupervised learning method.
  • Often more stable to train than GANs.

GAN Architecture

The architecture of a GAN consists of two main components:

  • Generator (G): Takes random noise as input and generates synthetic (fake) images.
  • Discriminator (D): Receives both real images (from the training dataset) and fake images (from the generator), and learns to classify them as real or fake.

The training process is adversarial:

  • The generator tries to produce images that are indistinguishable from real ones.
  • The discriminator tries to correctly identify which images are real and which are generated.

Training Process

Training Dynamics

  • Initially, the discriminator easily distinguishes real from fake.
  • As training progresses, the generator improves, making it harder for the discriminator to tell the difference.
  • Ideally, the discriminator's accuracy approaches 50%, meaning it can no longer reliably distinguish real from fake—indicating a well-trained generator.

Loss Functions

GANs use two separate loss functions:

  • Discriminator Loss: Encourages the discriminator to:
    • Maximize the probability of correctly classifying real images as real.
    • Minimize the probability of misclassifying fake images as real.
  • Generator Loss: Encourages the generator to:
    • Maximize the probability that fake images are classified as real by the discriminator.

Both networks are trained using backpropagation.

Common Loss Functions

  • Binary Cross-Entropy (BCE): Measures the difference between predicted probabilities and actual labels. It heavily penalizes misclassifications.
  • Minimax Loss (original GAN formulation): Can lead to vanishing gradients early in training.
  • Wasserstein Loss (WGAN): Improves training stability by replacing the discriminator with a critic that outputs a real-valued score instead of a probability.

Common Challenges in GAN Training

1. Vanishing Gradients

  • Occurs when the discriminator becomes too strong early in training.
  • The generator receives little to no gradient information, making it hard to improve.
  • Solutions:
    • Use Wasserstein loss instead of minimax.
    • Apply gradient penalty or label smoothing.

2. Mode Collapse

  • The generator produces limited types of outputs (e.g., the same image repeatedly) to fool the discriminator.
  • Fails to capture the diversity of the real dataset.
  • Solutions:
    • Use Wasserstein loss.
    • Implement Unrolled GANs, where the generator anticipates future discriminator updates.
    • Apply mini-batch discrimination or feature matching.

3. Failure to Converge

  • The generator and discriminator fail to reach equilibrium.
  • The discriminator may become too weak or too strong, leading to unstable training.
  • Symptoms: Generator quality degrades, discriminator gives random feedback.
  • Solutions:
    • Add noise to discriminator inputs.
    • Apply weight regularization to prevent overfitting.
    • Use learning rate scheduling or two-time scale update rules (TTUR).

⚠️ Note: These problems are still active areas of research. No universal solution exists, and training GANs often requires careful tuning and experimentation.

Previous Post Next Post