In 2025, I undertook specialized courses on TensorFlow alongside my Master of Science in Applied Artificial Intelligence to strengthen practical and applied skills in parallel with the program’s theoretical and scientific focus.
Similarly to Kaggle, https://colab.research.google.com/ allows to run Tensorflow code on cloud.
NumPy achieves high performance by offloading computationally intensive operations to precompiled C and Fortran libraries, allowing it to bypass Python’s interpreter overhead. It uses contiguous, typed memory blocks similar to C arrays, which improve cache efficiency and enable fast, low-level access.
Through vectorization and broadcasting, NumPy performs operations on entire arrays without explicit Python loops, significantly speeding up execution.
Additionally, it integrates with highly optimized libraries like BLAS and LAPACK for linear algebra tasks, further boosting performance while keeping code in high-level Python.
TensorFlow is a deep learning framework specifically designed for building and deploying neural network models. It is widely used in production environments due to its scalability and robustness.
Other specialized platforms for specific use cases include:
For an introduction to tensors see My notes about Deep Learning.
@tf.function
decorators, which convert Python functions into TensorFlow computation graphs.While TensorFlow is best known for deep learning, it also supports:
03_02_Train.ipynb
)Learn how to:
model.compile(loss='loss_function')
.model.fit()
by specifying parameters like epochs
, batch_size
, etc.Plot training loss:
plt.plot(learning.history['loss'])
Access weights of a specific layer (e.g., layer at index 1):
rn_model.layers[1].weights
This returns the kernel (weights) and bias values. For a dense layer with one unit, dense/kernel
represents the slope, and dense/bias
the offset.
input_dim=64
).TensorFlow allows you to save the following components of a model:
model = tf.keras.Sequential([
tf.keras.layers.Dense(10),
tf.keras.layers.Dense(2)
])
model.save('TEST.keras')
This saves the model in the 'TEST.keras'
format, stored in a directory.
restored_model = tf.keras.models.load_model('TEST.keras')
Note: If you need to use the model with different data types (e.g., float32 vs int), you must either re-train the model on the new data or convert future data to match the model's original data type.
You can stack various layers such as:
Dense
: Fully-connected layerConv2D
: Convolutional layerunits
parameter defines the number of neurons.input_dim
(or input_shape
) defines the input size.model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=8, input_dim=1))
model.add(tf.keras.layers.Dense(units=4))
model.add(tf.keras.layers.Dense(units=1))
Choose:
'adam'
, 'sgd'
, tf.keras.optimizers.Lion()
)'mse'
for regression)model.compile(optimizer='adam', loss='mse')
Train the model using:
model.fit(x, y, epochs=10)
x
: input datay
: target dataepochs
: number of training iterations over the dataset'relu'
for hidden layers.model.add(layers.Dense(units=16, activation='relu'))
See deeplearning.md
for more on activation functions.
Add L2 regularization to reduce overfitting:
model.add(layers.Dense(units=8, kernel_regularizer=tf.keras.regularizers.L2(0.1)))
model.compile(optimizer=tf.keras.optimizers.Lion(), loss='mse')
BatchNormalization
layer immediately before the target layer.model.add(layers.Dense(units=16))
model.add(tf.keras.layers.BatchNormalization())
model.add(layers.Dense(units=8))
Dropout
layer after the layer it regularizes.model.add(layers.Dense(units=16))
model.add(layers.Dense(units=8))
model.add(layers.Dropout(rate=0.5))
In this example, 50% of the neurons in the 8-unit layer will be deactivated randomly during each training epoch.
When using a batch size of 150 during neural network training with backpropagation, each iteration (or step) processes a randomly sampled subset of 150 data points from the dataset. These batches may vary across iterations, especially if shuffling is applied, though some overlap can occur depending on the dataset size and sampling strategy.
In each iteration, the model’s weights are updated based on the average gradient of the loss function computed over the current batch, scaled by the learning rate.
The total number of iterations per epoch is determined by the dataset size and batch size—specifically, it equals ⌈\(N\) / batch_size⌉, where \(N\) is the total number of training examples.
For example, a dataset with 1,000 samples and a batch size of 150 would require 7 iterations per epoch (with the last batch potentially being smaller unless explicitly dropped).
The overall number of iterations in training is then the product of the number of epochs and the iterations per epoch.
Note that one epoch is complete when the model has seen every training sample once (though not necessarily in a single batch).
In the context of Stochastic Gradient Descent (SGD) and its variants (e.g., mini-batch SGD), the optimizer (such as Adam, RMSprop, or SGD with momentum) plays a critical role in determining how the model's weights are updated using the gradients computed during backpropagation. Without an optimizer, vanilla SGD would apply the same fixed learning rate to all weights, which can lead to slow convergence or instability.
\(w_{t+1} = w_t - \eta \cdot \nabla_w \mathcal{L}(w_t)\)
where \(\eta\) is the learning rate and \(\nabla_w \mathcal{L}\) is the gradient of the loss.
These introduce adaptive mechanisms to address SGD’s limitations:
Adaptive Learning Rates (e.g., Adam, RMSprop, Adagrad)
\(w_{t+1} = w_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\),
where \(\hat{m}_t\) is momentum and \(\hat{v}_t\) scales the learning rate adaptively.
The optimizer decides how to use gradients to update weights, introducing mechanisms like momentum or parameter-specific learning rates to make SGD more efficient and stable. Adam is often the default choice due to its adaptivity, but vanilla SGD (with tuning) can still outperform it in certain scenarios (e.g., training LLMs).
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential, clone_model
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt
import time
# 1. Load and preprocess data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# 2. Define the original model
def create_model():
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(400, activation='relu'),
Dense(300, activation='relu'),
Dense(10, activation='softmax')
])
return model
# 3. Initialize and train the original model
original_model = create_model()
initial_weights = original_model.get_weights() # Save the initial random weights
original_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
original_model_learning = original_model.fit(x_train, y_train, epochs=4, batch_size=128, validation_split=0.1)
Epoch 1/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 14ms/step - accuracy: 0.8731 - loss: 0.4482 - val_accuracy: 0.9650 - val_loss: 0.1220
Epoch 2/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 11ms/step - accuracy: 0.9712 - loss: 0.0939 - val_accuracy: 0.9727 - val_loss: 0.0854
Epoch 3/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - accuracy: 0.9828 - loss: 0.0571 - val_accuracy: 0.9785 - val_loss: 0.0738
Epoch 4/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.9881 - loss: 0.0382 - val_accuracy: 0.9770 - val_loss: 0.0837
# 4. Prune the model (simple magnitude-based pruning)
def prune_weights(model, pruning_percent=0.5):
weights = model.get_weights()
new_weights = []
for w in weights:
if len(w.shape) > 1: # only prune dense layer weights
threshold = np.percentile(np.abs(w), pruning_percent * 100)
mask = np.abs(w) > threshold
w = w * mask # zero out the small weights
new_weights.append(w)
return new_weights
pruned_weights = prune_weights(original_model, pruning_percent=0.5)
# 5. Reinitialize model with original random weights and apply the mask (winning ticket)
winning_ticket_model = create_model()
winning_ticket_model.set_weights(initial_weights)
masked_weights = prune_weights(winning_ticket_model, pruning_percent=0.4)
winning_ticket_model.set_weights(masked_weights)
# 6. Retrain the pruned model (winning ticket)
winning_ticket_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
winning_ticket_model_learning = winning_ticket_model.fit(x_train, y_train, epochs=4, batch_size=128, validation_split=0.1)
Epoch 1/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 14ms/step - accuracy: 0.8732 - loss: 0.4575 - val_accuracy: 0.9715 - val_loss: 0.0985
Epoch 2/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 11ms/step - accuracy: 0.9701 - loss: 0.1006 - val_accuracy: 0.9773 - val_loss: 0.0757
Epoch 3/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 14ms/step - accuracy: 0.9824 - loss: 0.0573 - val_accuracy: 0.9795 - val_loss: 0.0659
Epoch 4/4
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.9882 - loss: 0.0391 - val_accuracy: 0.9810 - val_loss: 0.0705
# 7. Evaluate both models
print("Original model performance:")
original_model.evaluate(x_test, y_test)
print("Winning ticket model performance:")
winning_ticket_model.evaluate(x_test, y_test)
Original model performance:
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9677 - loss: 0.1016
Winning ticket model performance:
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9738 - loss: 0.0804
[0.07047921419143677, 0.977400004863739]
# Get the weights of the layer (Dense(100, activation='relu'))
print(original_model.layers[1].get_weights()[0].shape)
original_model.layers[1].get_weights()[0] # Shape: (input_dim, output_dim)
(784, 400)
array([[-0.05667121, 0.02065124, -0.04778236, ..., 0.05302081,
-0.06146147, 0.00401637],
[ 0.00472874, 0.00062283, -0.06726848, ..., -0.01869787,
0.00495344, 0.01626997],
[-0.06472269, 0.06735108, 0.01358175, ..., 0.00604779,
-0.04576658, 0.02837193],
...,
[ 0.05708074, 0.05192086, 0.04327519, ..., -0.05372921,
0.06987718, 0.01477351],
[ 0.05564093, -0.06744465, -0.01727122, ..., -0.03752626,
0.01364662, 0.01879197],
[ 0.04817592, 0.00282612, -0.04398317, ..., 0.04413202,
-0.01568541, -0.02512715]], dtype=float32)
# Get the weights of the layer
weights = winning_ticket_model.layers[1].get_weights()[0] # Shape: (input_dim, output_dim)
# Create a boolean mask for weights == 0
zero_mask = (np.abs(weights) == 0)
# Optional: Count the percentage of zeros
sparsity = np.mean(zero_mask) * 100
print(f"Sparsity: {sparsity:.2f}% of weights are exactly zero")
# Visualize the mask (e.g., as a heatmap)
import matplotlib.pyplot as plt
plt.imshow(zero_mask, cmap='gray', interpolation='none')
plt.title("Mask of Zero Weights (White = Zero)")
plt.colorbar()
plt.show()
Sparsity: 4.46% of weights are exactly zero
GANs consist of two neural networks in competition:
This setup forms a zero-sum game where:
The architecture of a GAN consists of two main components:
The training process is adversarial:
GANs use two separate loss functions:
Both networks are trained using backpropagation.
⚠️ Note: These problems are still active areas of research. No universal solution exists, and training GANs often requires careful tuning and experimentation.