Learning Gradients of Convex Functions with Monotone Gradient Networks

Feb 20, 2025·
Thomas Gravier
,
Emilio Picard
· 3 min read
Learning Gradients of Convex Functions with Monotone Gradient Networks - Project visualization.

This project was part of the Generative Modeling course at ENS Paris-Saclay – Master MVA, focusing on the intersection of deep learning, convex analysis, and optimal transport.
It aimed to explore Monotone Gradient Networks (MGNs) — architectures that learn gradients of convex functions to model structured transport maps between probability distributions.


Objective

To understand, rederive, and implement recent architectures for learning convex gradients and extend them to generative modeling tasks.
We analyzed and compared two recent architectures:

  • Cascaded Monotone Gradient Network (C-MGN)
  • Modular Monotone Gradient Network (M-MGN)

These models were originally proposed to learn monotone operators corresponding to the gradients of convex potentials, offering a principled connection between convex optimization and generative transport.


Theoretical Framework

The work revisits fundamental results from Brenier’s theorem and optimal transport theory, where the optimal transport map between two continuous distributions is the gradient of a convex function.

We formally rederived:

  • The PSD constraints on network Jacobians to ensure monotonicity
  • Theoretical proofs that C-MGN and M-MGN satisfy convex-gradient conditions
  • The connection between Sinkhorn distances, Wasserstein metrics, and entropy-regularized OT

Mathematical Foundation

The key insight is that the optimal transport map T* from distribution μ to ν can be written as the gradient of a convex potential function φ. For MGNs, we ensure convexity by constraining the Jacobian to be positive semi-definite, which guarantees that the learned transport map is monotone and corresponds to an optimal coupling.


Experiments

1. Gradient Field Approximation

We validated the architectures on synthetic convex functions such as f(x) = x₁⁴ + x₂²/2 + x₁x₂/2.
Both models achieved MAE ≈ 10⁻⁵ on gradient field prediction.

2. Optimal Transport Between Distributions

We trained both models to learn transport maps between:

  • Gaussian → Gaussian: Standard multivariate distributions
  • Gaussian → Banana: Non-linear target distributions

Using Sinkhorn loss, the results were:

ModelDatasetWasserstein Distance ↓
C-MGNGaussian0.12
M-MGNGaussian0.09
C-MGNBanana0.19
M-MGNBanana0.17

3. High-Dimensional Generative Modeling

  • MNIST: mapping Gaussian noise → handwritten digits (transport-based generation).
  • CIFAR-10: grayscale → color image translation (optimal coupling for colorization).
  • Domain Adaptation: style transfer from day → sunset scenes via learned color distributions.

C-MGN produced results comparable to small VAEs, while maintaining theoretical guarantees of convexity and monotonicity.


Technical Implementation

Architecture Details

  • C-MGN (Cascaded): Sequential convex layers with PSD constraints
  • M-MGN (Modular): Parallel branches combined with convex combination

Key Components

  1. Convexity Constraints: Enforced via spectral normalization of Jacobians
  2. Monotonicity: Guaranteed through positive-definite Hessian matrices
  3. Transport Loss: Combination of Wasserstein distance and Sinkhorn regularization

Key Findings

  • MGN models enable stable generative training using transport-based objectives.
  • Convex constraints enhance interpretability and reduce mode collapse in learned mappings.
  • Successful extension of MGNs to high-dimensional visual data (MNIST, CIFAR-10).
  • Demonstrated structured image-to-image translation using optimal transport principles.

Perspectives

Future work includes integrating convolutional operators into MGNs to better capture local spatial correlations while preserving convexity.
This approach could serve as a bridge between optimal transport, diffusion models, and energy-based generative modeling.


References

  • Chaudhari, S., Pranav, S., Moura, J. (2023–2025). Learning Gradients of Convex Functions with Monotone Gradient Networks.
  • Peyré, G., Cuturi, M. (2020). Computational Optimal Transport.
  • Brenier, Y. (1991). Polar Factorization and Monotone Rearrangement of Vector-Valued Functions.
  • Cuturi, M. (2013). Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances.
  • Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians.