Analyzing Performance in Privacy-Preserving Machine Learning

Hello readers!

This post is structured as a journal entry and will be updated as I progress through my Master's Project.

Description

Machine learning tasks can be implemented to protect the privacy of the training data using Differentially Private Stochastic Gradient Descent (DPSGD).
DPSGD incurs a performance penalty because of the additional noise sampling step.
Several optimizations have been proposed to reduce this performance penalty, including Opacus, FastGradClip, GhostClip, and Book-Keeping.

Goals

I have to use the open-source codes published for the proposed optimizations and evaluate the performance and privacy of the different optimizations.

Machine learning task

I will evaluate the performance and Privacy of the segmentation of retinal images with the convolutional neural network U-Net.

An optical coherence tomography(OCT) scan is a safe procedure that involves using reflected visible light from a low-power laser to obtain retina images — the principle is similar to ultrasound but uses light instead of sound.
Retinal layer segmentation in OCT images is essential for detecting and prognosing disease. These images are used for early detection of eye-related diseases such as diabetic retinopathy and age-related macular degeneration.

Privacy, but why is it important here?

Here's why Privacy is paramount in this context:

Sensitive Personal Information
- Retinal images, like fingerprints, are unique to individuals. They can contain sensitive personal information, potentially revealing medical conditions related to the eye and other health issues. This uniqueness makes it essential to handle such data carefully to protect individuals' Privacy.
Secondary Usage of Data
- Retinal images might be used for purposes beyond the initial medical diagnosis, such as research or training of AI models. Ensuring Privacy in these secondary uses is crucial to prevent unintended exposure of personal information.
Data Breach Risks
- As with any digital data, retinal images stored or transmitted electronically are susceptible to unauthorized access through data breaches. If these images are not adequately protected, they could be misused for identity theft, unauthorized tracking, or discrimination.

As I progress through the literature review and implementation stages, I will be incorporating sections.

Efficient Reading of Papers in Science and Technology

Preparation
What to read?
Read for breadth
Read in-depth
Take notes

Heilmeier's Catechism

Articulating objectives in simple words is crucial.
Ask a lot of "whys" while reading and implementing the process.
Be clear about what challenge is being solved.

Convolution

A mathematical operation on two functions that produces a third function.
The term convolution refers to both the result function and the computing process.
Convolution manipulates images, resulting in the images' blurring, sharpening, edge detection, and noise reduction.

Neural Network

It consists of layers of nodes - an input layer, one or more hidden layers and an output layer.
Each node connects to others and has its associated weight and threshold.
Data is only sent to the next layer by an individual node when output exceeds the specified threshold value.

Convolutional Neural Network (CNN)

CNN shows superior performance with image, speech or audio signal inputs.

Three main types of layers:

Convolutional layer

Core of CNN: Convolution operations using filters to produce feature maps.
Key Components:
- Input data
- Filter/kernel
- Feature map
Convolution involves a filter moving across the image with fixed weights ("parameter sharing"), adjusted during training via backpropagation and gradient descent.
Three critical hyperparameters affecting output volume must be set before the neural network training begins.:
- Number of Filters: Determines output depth; more filters result in more feature maps.
- Stride: Distance the filter moves over the input, affecting output size; larger strides reduce output dimensions.
- Padding: Addresses filter alignment with input; types include valid (no padding), same (maintains input size), and full (increases output size).
Post-convolution step: Application of a Rectified Linear Unit (ReLU) to add nonlinearity to the model.

Pooling layer

Pooling Layers: Serve as downsampling, reducing input dimensionality and parameters.
Operation: Pooling applies an aggregation function via a filter across the input, populating the output array.
Types of Pooling:
- Max Pooling: Selects the maximum value within the filter's receptive field for the output, commonly preferred for its performance.
- Average Pooling: Calculates the average value within the receptive field for the output.
Benefits: Reduces model complexity, enhances efficiency, and helps prevent overfitting.

Fully Connected (FC) Layer

Output Layer Nodes: Unlike partially connected layers, each node connects directly to a node in the preceding layer.
FC Layers: Often use softmax activation function for input classification, yielding probabilities between 0 and 1.

Gradient Descent

An optimization algorithm for training machine learning models and neural networks to minimize the cost function—the error between predicted and actual outcomes.
Key Components:
- Direction and Learning Rate: Essential for calculating partial derivatives in future iterations, guiding the algorithm towards the local or global minimum. Determines the step size towards the minimum; a balance is needed to avoid overshooting or slow convergence.
- Cost Function: Represents the average error across the entire training set and evaluates the error between actual and predicted values, guiding parameter adjustments to minimize error and reach the minimum efficiently.
Process: Iteratively moves toward steepest descent, adjusting parameters until the cost function is minimized, signifying the model has effectively learned.
Stochastic Gradient Descent (SGD)
- Processes each training example individually in an epoch, updating parameters individually and enhancing memory efficiency.
- Frequent updates provide detailed adjustments and can speed up the training process, with the potential to escape local minima and find the global minimum.

Differential Privacy

Differential Privacy ensures minimal knowledge gained about individuals from data analysis, maintaining consistency before and after.
It protects against adversaries gaining significant insights about individuals, even with access to the database.
The approach involves introducing randomness to the data processing to safeguard Privacy.
Ensures the integrity of data analysis without compromising individual Privacy.
Protects against privacy breaches from current and future auxiliary information sources.

Differentially Private Stochastic Gradient Descent

Differentially Private Stochastic Gradient Descent (DP-SGD) integrates differential Privacy into SGD by adding noise to gradients during training to meet privacy standards. Here's a breakdown of the process:

Begin by selecting a random mini-batch from the dataset.
Calculate the gradient of the loss function for this mini-batch.
Clip the gradients to a predefined maximum norm to control their influence.
Introduce noise (typically Gaussian or Laplacian) to these clipped gradients to obscure the impact of any single data point.
Update model parameters using these noise-added gradients, considering the learning rate.

The scale of the added noise is determined by privacy parameters and the dataset's sensitivity, which is the potential change in output from altering a single data point. Balancing the clipping norm and noise scale is critical to ensure the model remains useful and privacy-compliant, preventing underfitting or privacy breaches.

U-Net Architecture

U-Net Architecture: Designed for semantic segmentation, featuring contracting and expansive paths.
Contracting Path
- Mimics a standard convolutional network structure with repeated 3x3 convolutions, ReLU activation, and 2x2 max pooling for downsampling, doubling feature channels at each step.
Expansive Path
- Involves upsampling of feature maps, 2x2 "up-convolutions" to reduce feature channels, concatenation with matching feature maps from the contracting path, and two 3x3 convolutions and ReLU activation.
Cropping: Necessary to compensate for border pixel loss during convolution.
Output Layer: Utilizes a 1x1 convolution to map feature vectors to class labels.
Total Layers: The network comprises 23 convolutional layers.

Differentially Private Optimization on Large Model at Small Cost

Bottlenecks/Existing Challenges
- Per sample gradient clipping - limit Sensitivity and aim to enhance Privacy
- Per sample gradient norms
  - Computationally Intensive
  - Memory Intensive
Existing Solutions
- Opacus, TF privacy, fast Grad Clip - slow down training because of multiple rounds of backpropagation
  - Demand a lot of memory
- Ghost clipping
  - Norms per sample gradients without computing gradients
  - Two rounds of backpropagation, double training time compared to non-DP training.
Proposed solution
- Book Keeping Technique
  - Single round of Backpropagation
- Hybrid solution
BK Base
- Gost Norm + book keeping t ghost_differential.
Ghost Norm
- Outer Product can be calculated from the norms of the vector involved
- Full gradient context
  - Computed during backpropagation
  - Outer product of Input activations and gradient of loss
- Impact
  - Reduced memory footprint
  - Efficiency when the number of parameters is large
Book Keeping Trick
- Context of Ghost Clipping
  - Step 1: Compute gradient norms
  - Step 2: Compute gradient for parameters updates
- With the help of the BK trick, there is no need for step 2 anymore
  - Saves output gradients from step 1, which are discarded after the update of parameters, resulting in reduced computing cost.
  - Good for large models.
Ghost Differential
- Optimise Step 1.
- Step 1 consists of two parts
  - I - Output gradient with respect to activation
  - II - Compute parameters gradient
- II part is skipped for the calculation to avoid the computational cost.
Conclusion
- BK Algo = Ghost Norm + Book Keeping Trick + Ghost Differential
- Significant Efficiency Improvement
  - Space Complexity ~= Non DP Training
  - Time Complexity ~= Non DP Training
- Hybrid BK - This is for high-dimensional data and can be used by previously available optimizers.
- Extension to Parameter Efficient
  - Fine-tuning - Does work on updating only a small subset of model parameters during the tuning.
- Limitations
  - Applicable to only weights of generalized linear layers and not to biases, which is not important, however.
  - There is a gap due to the limitations of Pytorch Hooks
    - What? - Model introspection and modification are needed during training.
    - How? - Custom CUDA kernels or symbolic programming

Opacus: User-Friendly Differential Privacy Library in PyTorch

This publication was more like a report than a paper when I read through it.

Bottlenecks/Existing Challenges
- Cost of make a training pipeline is very high.
- Speed of training pipeline while ensuring privacy is slow.
- Existing methods are not flexibile to use with wide variety of layers.
Proposed Solution
- Simplicity: Enables machine learning practitioners to make a training pipeline private by adding as little as two lines to their code.
- Flexibility: Supports a wide variety of layers, including multi-head attention, convolution, LSTM, GRU (and generic RNN), and embedding, right out of the box and provides the means for supporting other user-defined layers.
- Speed: Opacus computes batched per-sample gradients, providing higher efficiency compared to the traditional “micro batch” approach. It implements performance improving vectorized computation.
Features of Opacus

Opacus provides a powerful framework for implementing DP in machine learning, with an emphasis on flexibility, efficiency, and compatibility with existing PyTorch models and workflows.
1. Privacy Accounting
  - Users can set a target (ε,δ) privacy budget, and Opacus will automatically adjust the noise level (σ) to meet this budget. It also supports custom privacy accountants.
2. Model Validation
  - Before training, Opacus verifies that the model's architecture is suitable for DP-SGD. It disallows layers that mix information across batch samples, such as certain configurations of BatchNorm and GroupNorm, since they complicate the computation of per-sample gradients.
3. Posission Sampling
  - Supports uniform (Poisson) sampling of data batches, where each data point has an independent probability of being included in a batch. This sampling method aligns with some DP-SGD analyses.
4. Efficiency
  - Opacus is optimized for hardware accelerators and supports distributed training through PyTorch's DistributedDataParallel.
5. Virtual Steps
  - To manage the trade-off between speed and memory usage when computing per-sample gradients, Opacus allows for the separation of physical batch sizes (limited by memory) and logical batch sizes (determined by model convergence and privacy considerations).
6. Predefined and Custom Layers
  - Opacus includes predefined layers (e.g., convolution, LSTM, GRU, embedding) and allows users to add custom layers by providing a method to calculate per-sample gradients.
7. Secure random number generation
  - Offers an option for cryptographically secure pseudo-random number generation (CSPRNG) for noise addition and batch composition, enhancing security for sensitive applications.
8. Noise scheduler and Variable batch size
  - Introduces a noise scheduler that adjusts the noise multiplier during training according to predefined schedules (e.g., exponential, step). It also accommodates variable batch sizes throughout the training process.
9. Modular
  - Integrates seamlessly with PyTorch Lightning, simplifying the coding of complex networks and reducing boilerplate code.
Benchmarking
- The study benchmarks Opacus against various frameworks by comparing runtimes in end-to-end model training tasks.
- It evaluates Opacus on tasks like MNIST and CIFAR-10 image classification, and IMDb sentiment analysis, against frameworks like JAX, TensorFlow Privacy, BackPACK, and PyVacy.
- Opacus is shown to validate models for DP compatibility, support Poisson sampling, offer efficient hardware utilization, and allow virtual steps to manage memory usage.
Results
- JAX (DP) generally provides the lowest runtime, particularly on MNIST and IMDb, and outperforms even non-DP PyTorch at smaller batch sizes. However, Opacus becomes competitive at larger batch sizes.
- Opacus incurs a runtime overhead of 2x to 2.9x on most tasks compared to non-DP PyTorch, but this increases significantly (up to 30x) for models like LSTMs.
- Memory overhead with Opacus grows with batch size, particularly for layers like linear and embedding, due to the necessity of computing per-sample gradients in DP-SGD.
Conclusion
- Opacus enables DP training with manageable overheads for most tasks and layers, challenges remain for complex models and larger batch sizes, mainly due to the inherent requirements of DP-SGD.

References:

If you liked this post, sign up to get updates in your email when I write something new! No spam ever.

Subscribe to the Newsletter