Demo 0: Example and usage¶
In order to make things simple the following rules have been followed during development:
deel-lip
follows thekeras
package structure.- All elements (layers, activations, initializers, ...) are compatible
with standard the
keras
elements. - When a k-Lipschitz layer overrides a standard keras layer, it uses the same interface and the same parameters. The only difference is a new parameter to control the Lipschitz constant of a layer.
Which layers are safe to use?¶
The following table indicates which layers are safe to use in a Lipshitz network, and which are not.
layer | 1-lip? | deel-lip equivalent | comments |
---|---|---|---|
Dense |
no | SpectralDense FrobeniusDense |
SpectralDense and FrobeniusDense are similar when there is a single output. |
Conv2D |
no | SpectralConv2D FrobeniusConv2D |
SpectralConv2D also implements Björck normalization. |
MaxPooling GlobalMaxPooling |
yes | n/a | |
AveragePooling2D GlobalAveragePooling2D |
no | ScaledAveragePooling2D ScaledGlobalAveragePooling2D |
The lipschitz constant is bounded by sqrt(pool_h * pool_h) . |
Flatten |
yes | n/a | |
Dropout |
no | None | The lipschitz constant is bounded by the dropout factor. |
BatchNormalization |
no | None | We suspect that layer normalization already limits internal covariate shift. |
Design tips¶
Designing lipschitz networks requires a careful design in order to avoid vanishing/exploding gradient problems.
Choosing pooling layers:
layer | advantages | disadvantages |
---|---|---|
ScaledAveragePooling2D and MaxPooling2D |
very similar to original implementation (just add a scaling factor for avg). | not norm preserving nor gradient norm preserving. |
InvertibleDownSampling |
norm preserving and gradient norm preserving. | increases the number of channels (and the number of parameters of the next layer). |
ScaledL2NormPooling2D (sqrt(avgpool(x**2))) |
norm preserving. | lower numerical stability of the gradient when inputs are close to zero. |
Choosing activations:
layer | advantages | disadvantages |
---|---|---|
ReLU |
create a strong vanishing gradient effect. If you manage to learn with it, please call 911. | |
MaxMin (stack([ReLU(x), ReLU(-x)])) |
have similar properties to ReLU, but is norm and gradient norm preserving | double the number of outputs |
GroupSort |
Input and GradientNorm preserving. Also limit the need of biases (as it is shift invariant). | more computationally expensive, (when its parameter n is large) |
Please note that when learning with the HKR_loss
and HKR_multiclass_loss
, no
activation is required on the last layer.
from deel.lip.layers import (
SpectralDense,
SpectralConv2D,
ScaledL2NormPooling2D,
FrobeniusDense,
)
from deel.lip.model import Sequential
from deel.lip.activations import GroupSort
from deel.lip.losses import MulticlassHKR, MulticlassKR
from tensorflow.keras.layers import Input, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import numpy as np
# Sequential (resp Model) from deel.model has the same properties as any lipschitz model.
# It act only as a container, with features specific to lipschitz
# functions (condensation, vanilla_exportation...) but The layers are fully compatible
# with the tf.keras.model.Sequential/Model
model = Sequential(
[
Input(shape=(28, 28, 1)),
# Lipschitz layers preserve the API of their superclass ( here Conv2D )
# an optional param is available: k_coef_lip which control the lipschitz
# constant of the layer
SpectralConv2D(
filters=16,
kernel_size=(3, 3),
activation=GroupSort(2),
use_bias=True,
kernel_initializer="orthogonal",
),
# usual pooling layer are implemented (avg, max...), but new layers are also available
ScaledL2NormPooling2D(pool_size=(2, 2), data_format="channels_last"),
SpectralConv2D(
filters=16,
kernel_size=(3, 3),
activation=GroupSort(2),
use_bias=True,
kernel_initializer="orthogonal",
),
ScaledL2NormPooling2D(pool_size=(2, 2), data_format="channels_last"),
# our layers are fully interoperable with existing keras layers
Flatten(),
SpectralDense(
32,
activation=GroupSort(2),
use_bias=True,
kernel_initializer="orthogonal",
),
FrobeniusDense(
10, activation=None, use_bias=False, kernel_initializer="orthogonal"
),
],
# similary model has a parameter to set the lipschitz constant
# to set automatically the constant of each layer
k_coef_lip=1.0,
name="hkr_model",
)
# HKR (Hinge-Krantorovich-Rubinstein) optimize robustness along with accuracy
model.compile(
# decreasing alpha and increasing min_margin improve robustness (at the cost of accuracy)
# note also in the case of lipschitz networks, more robustness require more parameters.
loss=MulticlassHKR(alpha=50, min_margin=0.05),
optimizer=Adam(1e-3),
metrics=["accuracy", MulticlassKR()],
)
model.summary()
# load data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# standardize and reshape the data
x_train = np.expand_dims(x_train, -1)
mean = x_train.mean()
std = x_train.std()
x_train = (x_train - mean) / std
x_test = np.expand_dims(x_test, -1)
x_test = (x_test - mean) / std
# one hot encode the labels
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# fit the model
model.fit(
x_train,
y_train,
batch_size=2048,
epochs=30,
validation_data=(x_test, y_test),
shuffle=True,
)
# once training is finished you can convert
# SpectralDense layers into Dense layers and SpectralConv2D into Conv2D
# which optimize performance for inference
vanilla_model = model.vanilla_export()