Skip to content

convolutions

AdaptiveOrthoConv2d(in_channels, out_channels, kernel_size, stride=1, padding='same', dilation=1, groups=1, bias=True, padding_mode='circular', ortho_params=OrthoParams())

Factory function to create an orthogonal convolutional layer, selecting the appropriate class based on kernel size and stride. This is the implementation for the Adaptive Orthogonal Convolution scheme [1]. It aims to be scalable to large networks and large image sizes, while enforcing orthogonality in the convolutional layers. This layer also intend to be compatible with all the feature of the nn.Conv2d class (e.g., striding, dilation, grouping, etc.). This method has an explicit kernel, which means that the forward operation is equivalent to a standard convolutional layer, but the weight are constrained to be orthogonal.

Key Features:

- Enforces orthogonality, preserving gradient norms.
- Supports native striding, dilation, grouped convolutions, and flexible padding.

Behavior:

- When kernel_size == stride, the layer is an `RKOConv2d`.
- When stride == 1, the layer is a `FastBlockConv2d`.
- Otherwise, the layer is a `BcopRkoConv2d`.
Note
  • This implementation also work under zero padding, it lipschitz constant is still tight, but it looses orthogonality.orthogonality on the image border.
  • the unit tesing validated for a tolerance of 1e-4 under various orthogonalization schemes (see reparametrizers). Only Cholesky based methods were validated for a lower tolerance of 5e-2.

Parameters:

Name Type Description Default
in_channels int

Number of input channels.

required
out_channels int

Number of output channels.

required
kernel_size _size_2_t

Size of the convolution kernel.

required
stride _size_2_t

Stride of the convolution. Default is 1.

1
padding str or _size_2_t

Padding mode or size. Default is "same".

'same'
dilation _size_2_t

Dilation rate. Default is 1.

1
groups int

Number of blocked connections from input to output channels. Default is 1.

1
bias bool

Whether to include a learnable bias. Default is True.

True
padding_mode str

Padding mode. Default is "circular".

'circular'
ortho_params OrthoParams

Parameters to control orthogonality. Default is OrthoParams().

OrthoParams()

Returns:

Type Description
Conv2d

A configured instance of nn.Conv2d (one of RKOConv2d, FastBlockConv2d, or BcopRkoConv2d).

Raises:

Type Description
`ValueError`

If kernel_size < stride, as orthogonality cannot be enforced.

References
  • [1] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025). An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures. https://arxiv.org/abs/2501.07930
Source code in orthogonium\layers\conv\AOC\ortho_conv.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def AdaptiveOrthoConv2d(
    in_channels: int,
    out_channels: int,
    kernel_size: _size_2_t,
    stride: _size_2_t = 1,
    padding: Union[str, _size_2_t] = "same",
    dilation: _size_2_t = 1,
    groups: int = 1,
    bias: bool = True,
    padding_mode: str = "circular",
    ortho_params: OrthoParams = OrthoParams(),
) -> nn.Conv2d:
    """
    Factory function to create an orthogonal convolutional layer, selecting the appropriate class based on kernel
    size and stride. This is the implementation for the `Adaptive Orthogonal Convolution` scheme [1]. It aims to be
    scalable to large networks and large image sizes, while enforcing orthogonality in the convolutional layers.
    This layer also intend to be compatible with all the feature of the `nn.Conv2d` class (e.g., striding, dilation,
    grouping, etc.). This method has an explicit kernel, which means that the forward operation is equivalent to a
    standard convolutional layer, but the weight are constrained to be orthogonal.

    Key Features:
    -------------
        - Enforces orthogonality, preserving gradient norms.
        - Supports native striding, dilation, grouped convolutions, and flexible padding.

    Behavior:
    ---------
        - When kernel_size == stride, the layer is an `RKOConv2d`.
        - When stride == 1, the layer is a `FastBlockConv2d`.
        - Otherwise, the layer is a `BcopRkoConv2d`.

    Note:
        - This implementation also work under zero padding, it lipschitz constant is still tight, but it looses
            orthogonality.orthogonality on the image border.
        - the unit tesing validated for a tolerance of 1e-4 under various orthogonalization schemes (see
            reparametrizers). Only Cholesky based methods were validated for a lower tolerance of 5e-2.

    Arguments:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        kernel_size (_size_2_t): Size of the convolution kernel.
        stride (_size_2_t, optional): Stride of the convolution. Default is 1.
        padding (str or _size_2_t, optional): Padding mode or size. Default is "same".
        dilation (_size_2_t, optional): Dilation rate. Default is 1.
        groups (int, optional): Number of blocked connections from input to output channels. Default is 1.
        bias (bool, optional): Whether to include a learnable bias. Default is True.
        padding_mode (str, optional): Padding mode. Default is "circular".
        ortho_params (OrthoParams, optional): Parameters to control orthogonality. Default is `OrthoParams()`.

    Returns:
        A configured instance of `nn.Conv2d` (one of `RKOConv2d`, `FastBlockConv2d`, or `BcopRkoConv2d`).

    Raises:
        `ValueError`: If kernel_size < stride, as orthogonality cannot be enforced.


    References:
        - [1] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
        An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
        <https://arxiv.org/abs/2501.07930>
    """

    if kernel_size < stride:
        raise ValueError(
            "kernel size must be smaller than stride. The set of orthonal convolutions is empty in this setting."
        )
    if kernel_size == stride:
        convclass = RKOConv2d
    elif (stride == 1) or ((in_channels >= out_channels) and (dilation > 1)):
        convclass = FastBlockConv2d
    else:
        convclass = BcopRkoConv2d
    return convclass(
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        padding=padding,
        dilation=dilation,
        groups=groups,
        bias=bias,
        padding_mode=padding_mode,
        ortho_params=ortho_params,
    )

AdaptiveOrthoConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1, padding_mode='zeros', ortho_params=OrthoParams())

Factory function to create an orthogonal transposed convolutional layer, selecting the appropriate class based on kernel size and stride. This is the implementation for the Adaptive Orthogonal Convolution scheme [1]. It aims to be scalable to large networks and large image sizes, while enforcing orthogonality in the convolutional layers. This layer also intend to be compatible with all the feature of the nn.Conv2d class (e.g., striding, dilation, grouping, etc.). This method has an explicit kernel, which means that the forward operation is equivalent to a standard convolutional layer, but the weight are constrained to be orthogonal.

Key Features:

- Ensures orthogonality in transpose convolutions for stable gradient propagation.
- Supports dilation, grouped operations, and efficient kernel construction.

Behavior:

- When kernel_size == stride, the layer is an `RkoConvTranspose2d`.
- When stride == 1, the layer is a `FastBlockConvTranspose2D`.
- Otherwise, the layer is a `BcopRkoConvTranspose2d`.
Note
  • This implementation also work under zero padding, it lipschitz constant is still tight, but it looses orthogonality.orthogonality on the image border.
  • The current implementation of the torch.nn.ConvTranspose2d does not support circular padding. One can implement padding manually by add a padding layer before and setting padding = (0,0).

Parameters:

Name Type Description Default
in_channels int

Number of input channels.

required
out_channels int

Number of output channels.

required
kernel_size _size_2_t

Size of the convolution kernel.

required
stride _size_2_t

Stride of the transpose convolution. Default is 1.

1
padding _size_2_t

Padding size. Default is 0.

0
output_padding _size_2_t

Additional size for output. Default is 0.

0
groups int

Number of groups. Default is 1.

1
bias bool

Whether to include a learnable bias. Default is True.

True
dilation _size_2_t

Dilation rate. Default is 1.

1
padding_mode str

Padding mode. Default is "zeros".

'zeros'
ortho_params OrthoParams

Parameters to control orthogonality. Default is OrthoParams().

OrthoParams()

Returns:

Type Description
ConvTranspose2d

A configured instance of nn.ConvTranspose2d (one of RkoConvTranspose2d, FastBlockConvTranspose2D, or BcopRkoConvTranspose2d).

Raises: - ValueError: If kernel_size < stride, as orthogonality cannot be enforced.

References
  • [1] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025). An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures. https://arxiv.org/abs/2501.07930
Source code in orthogonium\layers\conv\AOC\ortho_conv.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
def AdaptiveOrthoConvTranspose2d(
    in_channels: int,
    out_channels: int,
    kernel_size: _size_2_t,
    stride: _size_2_t = 1,
    padding: _size_2_t = 0,
    output_padding: _size_2_t = 0,
    groups: int = 1,
    bias: bool = True,
    dilation: _size_2_t = 1,
    padding_mode: str = "zeros",
    ortho_params: OrthoParams = OrthoParams(),
) -> nn.ConvTranspose2d:
    """
    Factory function to create an orthogonal transposed convolutional layer, selecting the appropriate class based on kernel
    size and stride. This is the implementation for the `Adaptive Orthogonal Convolution` scheme [1]. It aims to be
    scalable to large networks and large image sizes, while enforcing orthogonality in the convolutional layers.
    This layer also intend to be compatible with all the feature of the `nn.Conv2d` class (e.g., striding, dilation,
    grouping, etc.). This method has an explicit kernel, which means that the forward operation is equivalent to a
    standard convolutional layer, but the weight are constrained to be orthogonal.

    Key Features:
    -------------
        - Ensures orthogonality in transpose convolutions for stable gradient propagation.
        - Supports dilation, grouped operations, and efficient kernel construction.

    Behavior:
    ---------
        - When kernel_size == stride, the layer is an `RkoConvTranspose2d`.
        - When stride == 1, the layer is a `FastBlockConvTranspose2D`.
        - Otherwise, the layer is a `BcopRkoConvTranspose2d`.


    Note:
        - This implementation also work under zero padding, it lipschitz constant is still tight, but it looses
            orthogonality.orthogonality on the image border.
        - The current implementation of the torch.nn.ConvTranspose2d does not support circular padding. One can
            implement padding manually by add a padding layer before and setting padding = (0,0).

    Arguments:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        kernel_size (_size_2_t): Size of the convolution kernel.
        stride (_size_2_t, optional): Stride of the transpose convolution. Default is 1.
        padding (_size_2_t, optional): Padding size. Default is 0.
        output_padding (_size_2_t, optional): Additional size for output. Default is 0.
        groups (int, optional): Number of groups. Default is 1.
        bias (bool, optional): Whether to include a learnable bias. Default is True.
        dilation (_size_2_t, optional): Dilation rate. Default is 1.
        padding_mode (str, optional): Padding mode. Default is "zeros".
        ortho_params (OrthoParams, optional): Parameters to control orthogonality. Default is `OrthoParams()`.

    Returns:
        A configured instance of `nn.ConvTranspose2d` (one of `RkoConvTranspose2d`, `FastBlockConvTranspose2D`, or `BcopRkoConvTranspose2d`).

    **Raises:**
    - `ValueError`: If kernel_size < stride, as orthogonality cannot be enforced.


    References:
        - [1] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
        An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
        <https://arxiv.org/abs/2501.07930>
    """

    if kernel_size < stride:
        raise ValueError(
            "kernel size must be smaller than stride. The set of orthonal convolutions is empty in this setting."
        )
    if kernel_size == stride:
        convclass = RkoConvTranspose2d
    elif stride == 1:
        convclass = FastBlockConvTranspose2D
    else:
        convclass = BcopRkoConvTranspose2d
    return convclass(
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        padding=padding,
        output_padding=output_padding,
        groups=groups,
        bias=bias,
        dilation=dilation,
        padding_mode=padding_mode,
        ortho_params=ortho_params,
    )

SSL derived 1-Lipschitz Layers

This module implements several 1-Lipschitz residual blocks, inspired by and extending the SDP-based Lipschitz Layers (SLL) from [1]. Specifically:

  • SDPBasedLipschitzResBlock
    The original version of the 1-Lipschitz convolutional residual block. It enforces Lipschitz constraints by rescaling activation outputs according to an estimate of the operator norm.

  • SLLxAOCLipschitzResBlock
    An extended version of the SLL approach described in [1], combined with additional orthogonal convolutions to handle stride, kernel-size, or channel-dimension changes. It fuses multiple convolutions via the block convolution, thereby preserving the 1-Lipschitz property while enabling strided downsampling or modifying input/output channels.

  • AOCLipschitzResBlock
    A variant of the original Lipschitz block where the core convolution is replaced by an AdaptiveOrthoConv2d. It maintains the 1-Lipschitz property with orthogonal weight parameterization while providing efficient convolution implementations.

References

[1] Alexandre Araujo, Aaron J Havens, Blaise Delattre, Alexandre Allauzen, and Bin Hu. A unified alge- braic perspective on lipschitz neural networks. In The Eleventh International Conference on Learning Representations, 2023 [2] Thibaut Boissin, Franck Mamalet, Thomas Fel, Agustin Martin Picard, Thomas Massena, Mathieu Serrurier, An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures

Notes on the SLL approach

In [1], the SLL layer for convolutions is a 1-Lipschitz residual operation defined approximately as:

\[ y = x - \mathbf{K}^T \star (t \times \sigma(\mathbf{K} \star x + b)), \]

where \(\mathbf{K}\) represents a toeplitz (convolution) matrix that represent a 1-Lipschitz operator. This is done in practice by computing a normalization vector \(\mathbf{t}\) and rescaling the activation outputs by \(\mathbf{t}\).

By default, the SLL formulation does not allow strides or changes in the number of channels.
To address these issues, SLLxAOCLipschitzResBlock adds extra orthogonal convolutions before and/or after the main SLL operation. These additional convolutions can be merged via block convolution (Proposition 1 in [2]) to maintain 1-Lipschitz behavior while enabling stride and/or channel changes.

When \(\mathbf{K}\), \(\mathbf{K}_{pre}\), and \(\mathbf{K}_{post}\) each correspond to 2×2 convolutions, the resulting block effectively contains two 3×3 convolutions in one branch and a single 4×4 stride-2 convolution in the skip branch—quite similar to typical ResNet blocks.

AOCLipschitzResBlock

Bases: Module

Source code in orthogonium\layers\conv\SLL\sll_layer.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
class AOCLipschitzResBlock(nn.Module):
    def __init__(
        self,
        in_channels: int,
        inner_dim_factor: int,
        kernel_size: _size_2_t,
        dilation: _size_2_t = 1,
        groups: int = 1,
        bias: bool = True,
        padding_mode: str = "circular",
        ortho_params: OrthoParams = OrthoParams(),
    ):
        """
        A Lipschitz residual block in which the main convolution is replaced by
        `AdaptiveOrthoConv2d` (AOC). This preserves 1-Lipschitz (or lower) behavior through
        an orthogonal parameterization, without explicitly computing a scaling factor `t`.

        $$
        y = x - \mathbf{K}^T \\star (\sigma(\\mathbf{K} \\star x + b)),
        $$

        **Args**:
          - `in_channels` (int): Number of input channels.
          - `inner_dim_factor` (int): Multiplier for internal representation size.
          - `kernel_size` (_size_2_t): Convolution kernel size.
          - `dilation` (_size_2_t, optional): Default is 1.
          - `groups` (int, optional): Default is 1.
          - `bias` (bool, optional): If True, adds a learnable bias. Default is True.
          - `padding_mode` (str, optional): `'circular'` or `'zeros'`. Default is `'circular'`.
          - `ortho_params` (OrthoParams, optional): Orthogonal parameterization settings. Default is `OrthoParams()`.


        References:
            - [1] Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B.
            A Unified Algebraic Perspective on Lipschitz Neural Networks.
            In The Eleventh International Conference on Learning Representations.
            <https://arxiv.org/abs/2303.03169>
            - [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
            An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
            <https://arxiv.org/abs/2501.07930>
        """
        super().__init__()

        inner_dim = int(in_channels * inner_dim_factor)
        self.activation = nn.ReLU()

        if padding_mode not in ["circular", "zeros"]:
            raise ValueError("padding_mode must be either 'circular' or 'zeros'")
        if padding_mode == "circular":
            self.padding = 0  # will be handled by the padding function
        else:
            self.padding = kernel_size // 2

        self.in_conv = AdaptiveOrthoConv2d(
            in_channels,
            inner_dim,
            kernel_size=kernel_size,
            stride=1,
            padding="same",
            dilation=dilation,
            groups=groups,
            bias=bias,
            padding_mode=padding_mode,
            ortho_params=ortho_params,
        )
        self.kernel_size = kernel_size
        self.dilation = dilation
        self.groups = groups
        self.bias = bias
        self.padding_mode = padding_mode

    def forward(self, x):
        kernel = self.in_conv.weight
        # conv
        res = x
        if self.padding_mode == "circular":
            res = F.pad(
                res,
                (self.padding,) * 4,
                mode="circular",
                value=0,
            )
        res = F.conv2d(
            res,
            kernel,
            bias=self.in_conv.bias,
            padding=0,
            groups=self.groups,
        )
        # activation
        res = self.activation(res)
        # conv transpose
        if self.padding_mode == "circular":
            res = F.pad(
                res,
                (self.padding,) * 4,
                mode="circular",
                value=0,
            )
        res = 2 * F.conv_transpose2d(res, kernel, padding=0, groups=self.groups)
        # residual
        out = x - res
        return out

__init__(in_channels, inner_dim_factor, kernel_size, dilation=1, groups=1, bias=True, padding_mode='circular', ortho_params=OrthoParams())

A Lipschitz residual block in which the main convolution is replaced by AdaptiveOrthoConv2d (AOC). This preserves 1-Lipschitz (or lower) behavior through an orthogonal parameterization, without explicitly computing a scaling factor t.

\[ y = x - \mathbf{K}^T \star (\sigma(\mathbf{K} \star x + b)), \]

Args: - in_channels (int): Number of input channels. - inner_dim_factor (int): Multiplier for internal representation size. - kernel_size (_size_2_t): Convolution kernel size. - dilation (_size_2_t, optional): Default is 1. - groups (int, optional): Default is 1. - bias (bool, optional): If True, adds a learnable bias. Default is True. - padding_mode (str, optional): 'circular' or 'zeros'. Default is 'circular'. - ortho_params (OrthoParams, optional): Orthogonal parameterization settings. Default is OrthoParams().

References
  • [1] Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B. A Unified Algebraic Perspective on Lipschitz Neural Networks. In The Eleventh International Conference on Learning Representations. https://arxiv.org/abs/2303.03169
  • [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025). An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures. https://arxiv.org/abs/2501.07930
Source code in orthogonium\layers\conv\SLL\sll_layer.py
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
def __init__(
    self,
    in_channels: int,
    inner_dim_factor: int,
    kernel_size: _size_2_t,
    dilation: _size_2_t = 1,
    groups: int = 1,
    bias: bool = True,
    padding_mode: str = "circular",
    ortho_params: OrthoParams = OrthoParams(),
):
    """
    A Lipschitz residual block in which the main convolution is replaced by
    `AdaptiveOrthoConv2d` (AOC). This preserves 1-Lipschitz (or lower) behavior through
    an orthogonal parameterization, without explicitly computing a scaling factor `t`.

    $$
    y = x - \mathbf{K}^T \\star (\sigma(\\mathbf{K} \\star x + b)),
    $$

    **Args**:
      - `in_channels` (int): Number of input channels.
      - `inner_dim_factor` (int): Multiplier for internal representation size.
      - `kernel_size` (_size_2_t): Convolution kernel size.
      - `dilation` (_size_2_t, optional): Default is 1.
      - `groups` (int, optional): Default is 1.
      - `bias` (bool, optional): If True, adds a learnable bias. Default is True.
      - `padding_mode` (str, optional): `'circular'` or `'zeros'`. Default is `'circular'`.
      - `ortho_params` (OrthoParams, optional): Orthogonal parameterization settings. Default is `OrthoParams()`.


    References:
        - [1] Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B.
        A Unified Algebraic Perspective on Lipschitz Neural Networks.
        In The Eleventh International Conference on Learning Representations.
        <https://arxiv.org/abs/2303.03169>
        - [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
        An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
        <https://arxiv.org/abs/2501.07930>
    """
    super().__init__()

    inner_dim = int(in_channels * inner_dim_factor)
    self.activation = nn.ReLU()

    if padding_mode not in ["circular", "zeros"]:
        raise ValueError("padding_mode must be either 'circular' or 'zeros'")
    if padding_mode == "circular":
        self.padding = 0  # will be handled by the padding function
    else:
        self.padding = kernel_size // 2

    self.in_conv = AdaptiveOrthoConv2d(
        in_channels,
        inner_dim,
        kernel_size=kernel_size,
        stride=1,
        padding="same",
        dilation=dilation,
        groups=groups,
        bias=bias,
        padding_mode=padding_mode,
        ortho_params=ortho_params,
    )
    self.kernel_size = kernel_size
    self.dilation = dilation
    self.groups = groups
    self.bias = bias
    self.padding_mode = padding_mode

SDPBasedLipschitzDense

Bases: Module

Source code in orthogonium\layers\conv\SLL\sll_layer.py
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
class SDPBasedLipschitzDense(nn.Module):
    def __init__(self, in_features, out_features, inner_dim, **kwargs):
        """
        A 1-Lipschitz fully-connected layer (dense version). Similar to the convolutional
        SLL approach, but operates on vectors:

        $$
        y = x - K^T \\times (t \\times \sigma(K \\times x + b)),
        $$

        **Args**:
          - `in_features` (int): Input size.
          - `out_features` (int): Output size (must match `in_features` to remain 1-Lipschitz).
          - `inner_dim` (int): The internal dimension used for the transform.


        References:
            - Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B.
            A Unified Algebraic Perspective on Lipschitz Neural Networks.
            In The Eleventh International Conference on Learning Representations.
            <https://arxiv.org/abs/2303.03169>
        """
        super().__init__()

        inner_dim = inner_dim if inner_dim != -1 else in_features
        self.activation = nn.ReLU()

        self.weight = nn.Parameter(torch.empty(inner_dim, in_features))
        self.bias = nn.Parameter(torch.empty(1, inner_dim))
        self.q = nn.Parameter(torch.randn(inner_dim))

        nn.init.xavier_normal_(self.weight)
        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / np.sqrt(fan_in)
        nn.init.uniform_(self.bias, -bound, bound)  # bias init

    def compute_t(self):
        q = torch.exp(self.q)
        q_inv = torch.exp(-self.q)
        t = torch.abs(
            torch.einsum("i,ik,kj,j -> ij", q_inv, self.weight, self.weight.T, q)
        ).sum(1)
        t = safe_inv(t)
        return t

    def forward(self, x):
        t = self.compute_t()
        res = F.linear(x, self.weight)
        res = res + self.bias
        res = t * self.activation(res)
        res = 2 * F.linear(res, self.weight.T)
        out = x - res
        return out

__init__(in_features, out_features, inner_dim, **kwargs)

A 1-Lipschitz fully-connected layer (dense version). Similar to the convolutional SLL approach, but operates on vectors:

\[ y = x - K^T \times (t \times \sigma(K \times x + b)), \]

Args: - in_features (int): Input size. - out_features (int): Output size (must match in_features to remain 1-Lipschitz). - inner_dim (int): The internal dimension used for the transform.

References
  • Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B. A Unified Algebraic Perspective on Lipschitz Neural Networks. In The Eleventh International Conference on Learning Representations. https://arxiv.org/abs/2303.03169
Source code in orthogonium\layers\conv\SLL\sll_layer.py
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def __init__(self, in_features, out_features, inner_dim, **kwargs):
    """
    A 1-Lipschitz fully-connected layer (dense version). Similar to the convolutional
    SLL approach, but operates on vectors:

    $$
    y = x - K^T \\times (t \\times \sigma(K \\times x + b)),
    $$

    **Args**:
      - `in_features` (int): Input size.
      - `out_features` (int): Output size (must match `in_features` to remain 1-Lipschitz).
      - `inner_dim` (int): The internal dimension used for the transform.


    References:
        - Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B.
        A Unified Algebraic Perspective on Lipschitz Neural Networks.
        In The Eleventh International Conference on Learning Representations.
        <https://arxiv.org/abs/2303.03169>
    """
    super().__init__()

    inner_dim = inner_dim if inner_dim != -1 else in_features
    self.activation = nn.ReLU()

    self.weight = nn.Parameter(torch.empty(inner_dim, in_features))
    self.bias = nn.Parameter(torch.empty(1, inner_dim))
    self.q = nn.Parameter(torch.randn(inner_dim))

    nn.init.xavier_normal_(self.weight)
    fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
    bound = 1 / np.sqrt(fan_in)
    nn.init.uniform_(self.bias, -bound, bound)  # bias init

SDPBasedLipschitzResBlock

Bases: Module

Source code in orthogonium\layers\conv\SLL\sll_layer.py
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
class SDPBasedLipschitzResBlock(nn.Module):
    def __init__(self, cin, inner_dim_factor, kernel_size=3, groups=1, **kwargs):
        """
         Original 1-Lipschitz convolutional residual block, based on the SDP-based Lipschitz
        layer (SLL) approach [1]. It has a structure akin to:

        out = x - 2 * ConvTranspose( t * ReLU(Conv(x) + bias) )

        where `t` is a channel-wise scaling factor ensuring a Lipschitz constant ≤ 1.

        !!! note
            By default, `SDPBasedLipschitzResBlock` assumes `cin == cout` and does **not** handle
            stride changes outside the skip connection (i.e., typically used when stride=1 or 2
            for downsampling in a standard residual architecture).

        **Args**:
          - `cin` (int): Number of input channels.
          - `cout` (int): Number of output channels.
          - `inner_dim_factor` (float): Multiplier for the intermediate dimensionality.
          - `kernel_size` (int, optional): Size of the convolution kernel. Default is 3.
          - `groups` (int, optional): Number of groups for the convolution. Default is 1.
          - `**kwargs`: Additional keyword arguments (unused).


        References:
            - Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B.
            A Unified Algebraic Perspective on Lipschitz Neural Networks.
            In The Eleventh International Conference on Learning Representations.
            <https://arxiv.org/abs/2303.03169>
        """
        super().__init__()

        inner_dim = int(cin * inner_dim_factor)
        self.activation = nn.ReLU()
        self.groups = groups

        self.padding = kernel_size // 2

        self.kernel = nn.Parameter(
            torch.randn(inner_dim, cin // groups, kernel_size, kernel_size)
        )
        parametrize.register_parametrization(
            self,
            "kernel",
            AOLReparametrizer(
                inner_dim,
                groups=groups,
            ),
        )
        self.bias = nn.Parameter(torch.empty(1, inner_dim, 1, 1))
        self.q = nn.Parameter(torch.ones(inner_dim, 1, 1, 1))

        nn.init.xavier_normal_(self.kernel)
        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.kernel)
        bound = 1 / np.sqrt(fan_in)
        nn.init.uniform_(self.bias, -bound, bound)  # bias init

    def forward(self, x):
        res = F.conv2d(x, self.kernel, padding=self.padding, groups=self.groups)
        res = res + self.bias
        res = self.activation(res)
        with parametrize.cached():
            res = 2 * F.conv_transpose2d(
                res, self.kernel, padding=self.padding, groups=self.groups
            )
        out = x - res
        return out

__init__(cin, inner_dim_factor, kernel_size=3, groups=1, **kwargs)

Original 1-Lipschitz convolutional residual block, based on the SDP-based Lipschitz layer (SLL) approach [1]. It has a structure akin to:

out = x - 2 * ConvTranspose( t * ReLU(Conv(x) + bias) )

where t is a channel-wise scaling factor ensuring a Lipschitz constant ≤ 1.

Note

By default, SDPBasedLipschitzResBlock assumes cin == cout and does not handle stride changes outside the skip connection (i.e., typically used when stride=1 or 2 for downsampling in a standard residual architecture).

Args: - cin (int): Number of input channels. - cout (int): Number of output channels. - inner_dim_factor (float): Multiplier for the intermediate dimensionality. - kernel_size (int, optional): Size of the convolution kernel. Default is 3. - groups (int, optional): Number of groups for the convolution. Default is 1. - **kwargs: Additional keyword arguments (unused).

References
  • Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B. A Unified Algebraic Perspective on Lipschitz Neural Networks. In The Eleventh International Conference on Learning Representations. https://arxiv.org/abs/2303.03169
Source code in orthogonium\layers\conv\SLL\sll_layer.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
def __init__(self, cin, inner_dim_factor, kernel_size=3, groups=1, **kwargs):
    """
     Original 1-Lipschitz convolutional residual block, based on the SDP-based Lipschitz
    layer (SLL) approach [1]. It has a structure akin to:

    out = x - 2 * ConvTranspose( t * ReLU(Conv(x) + bias) )

    where `t` is a channel-wise scaling factor ensuring a Lipschitz constant ≤ 1.

    !!! note
        By default, `SDPBasedLipschitzResBlock` assumes `cin == cout` and does **not** handle
        stride changes outside the skip connection (i.e., typically used when stride=1 or 2
        for downsampling in a standard residual architecture).

    **Args**:
      - `cin` (int): Number of input channels.
      - `cout` (int): Number of output channels.
      - `inner_dim_factor` (float): Multiplier for the intermediate dimensionality.
      - `kernel_size` (int, optional): Size of the convolution kernel. Default is 3.
      - `groups` (int, optional): Number of groups for the convolution. Default is 1.
      - `**kwargs`: Additional keyword arguments (unused).


    References:
        - Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., & Hu, B.
        A Unified Algebraic Perspective on Lipschitz Neural Networks.
        In The Eleventh International Conference on Learning Representations.
        <https://arxiv.org/abs/2303.03169>
    """
    super().__init__()

    inner_dim = int(cin * inner_dim_factor)
    self.activation = nn.ReLU()
    self.groups = groups

    self.padding = kernel_size // 2

    self.kernel = nn.Parameter(
        torch.randn(inner_dim, cin // groups, kernel_size, kernel_size)
    )
    parametrize.register_parametrization(
        self,
        "kernel",
        AOLReparametrizer(
            inner_dim,
            groups=groups,
        ),
    )
    self.bias = nn.Parameter(torch.empty(1, inner_dim, 1, 1))
    self.q = nn.Parameter(torch.ones(inner_dim, 1, 1, 1))

    nn.init.xavier_normal_(self.kernel)
    fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.kernel)
    bound = 1 / np.sqrt(fan_in)
    nn.init.uniform_(self.bias, -bound, bound)  # bias init

SLLxAOCLipschitzResBlock

Bases: Module

Source code in orthogonium\layers\conv\SLL\sll_layer.py
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
class SLLxAOCLipschitzResBlock(nn.Module):
    def __init__(
        self, cin, cout, inner_dim_factor, kernel_size=3, stride=2, groups=1, **kwargs
    ):
        """
        Extended SLL-based convolutional residual block. Supports arbitrary kernel sizes,
        strides, and changes in the number of channels by integrating additional
        orthogonal convolutions *and* fusing them via `\mathbconv` [1].

        The forward pass follows:

        $$
        y = (\mathbf{K}_{post} \circledast \mathbf{K}_{pre}) \\star x - (\mathbf{K}_{post} \circledast \mathbf{K}^T) \\star (t \\times  \sigma(( \mathbf{K} \circledast \mathbf{K}_{pre}) \\star x + b)),
        $$

        where $\mathbf{K}_{pre}$ and $\mathbf{K}_{post}$ are obtained with AOC.


        <img src="../../assets/SLL_3.png" alt="illustration of SLL x AOC" width="600">



        where the kernel `\kernel{K}` may effectively be expanded by pre/post AOC layers to
        handle stride and channel changes. This approach is described in "Improving
        SDP-based Lipschitz Layers" of [1].

        **Args**:
          - `cin` (int): Number of input channels.
          - `inner_dim_factor` (float): Multiplier for the internal channel dimension.
          - `kernel_size` (int, optional): Base kernel size for the SLL portion. Default is 3.
          - `stride` (int, optional): Stride for the skip connection. Default is 2.
          - `groups` (int, optional): Number of groups for the convolution. Default is 1.
          - `**kwargs`: Additional options (unused).



        References:
            - Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
            An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
            <https://arxiv.org/abs/2501.07930>
        """
        super().__init__()
        inner_kernel_size = kernel_size - (stride - 1)
        self.skip_kernel_size = stride + (stride // 2)
        inner_dim = int(cout * inner_dim_factor)
        self.activation = nn.ReLU()
        self.stride = stride
        self.groups = groups
        self.padding = kernel_size // 2
        self.kernel = nn.Parameter(
            torch.randn(
                inner_dim, cin // self.groups, inner_kernel_size, inner_kernel_size
            )
        )
        parametrize.register_parametrization(
            self,
            "kernel",
            AOLReparametrizer(
                inner_dim,
                groups=groups,
            ),
        )
        self.bias = nn.Parameter(torch.empty(1, inner_dim, 1, 1))
        self.q = nn.Parameter(torch.ones(inner_dim, 1, 1, 1))

        nn.init.xavier_normal_(self.kernel)
        fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.kernel)
        bound = 1 / np.sqrt(fan_in)
        nn.init.uniform_(self.bias, -bound, bound)  # bias init

        self.pre_conv = AdaptiveOrthoConv2d(
            cin, cin, kernel_size=stride, stride=1, bias=False, padding=0, groups=groups
        )
        self.post_conv = AdaptiveOrthoConv2d(
            cin,
            cout,
            kernel_size=stride,
            stride=stride,
            bias=False,
            padding=0,
            groups=groups,
        )

    def forward(self, x):
        # compute t
        # print(self.pre_conv.weight.shape, self.kernel.shape, self.post_conv.weight.shape)
        kernel_1a = fast_matrix_conv(
            self.pre_conv.weight, self.kernel, groups=self.groups
        )
        with parametrize.cached():
            kernel_1b = fast_matrix_conv(
                transpose_kernel(self.kernel, groups=self.groups),
                self.post_conv.weight,
                groups=self.groups,
            )
            kernel_2 = fast_matrix_conv(
                self.pre_conv.weight, self.post_conv.weight, groups=self.groups
            )
            # first branch
            # fuse pre conv with kernel
            res = F.conv2d(x, kernel_1a, padding=self.padding, groups=self.groups)
            res = res + self.bias
            res = self.activation(res)
            res = 2 * F.conv2d(
                res,
                kernel_1b,
                padding=self.padding,
                stride=self.stride,
                groups=self.groups,
            )
            # residual branch
            x = F.conv2d(
                x,
                kernel_2,
                padding=self.skip_kernel_size // 2,
                stride=self.stride,
                groups=self.groups,
            )
        # skip connection
        out = x - res
        return out

__init__(cin, cout, inner_dim_factor, kernel_size=3, stride=2, groups=1, **kwargs)

Extended SLL-based convolutional residual block. Supports arbitrary kernel sizes, strides, and changes in the number of channels by integrating additional orthogonal convolutions and fusing them via \mathbconv [1].

The forward pass follows:

\[ y = (\mathbf{K}_{post} \circledast \mathbf{K}_{pre}) \star x - (\mathbf{K}_{post} \circledast \mathbf{K}^T) \star (t \times \sigma(( \mathbf{K} \circledast \mathbf{K}_{pre}) \star x + b)), \]

where \(\mathbf{K}_{pre}\) and \(\mathbf{K}_{post}\) are obtained with AOC.

illustration of SLL x AOC

where the kernel \kernel{K} may effectively be expanded by pre/post AOC layers to handle stride and channel changes. This approach is described in "Improving SDP-based Lipschitz Layers" of [1].

Args: - cin (int): Number of input channels. - inner_dim_factor (float): Multiplier for the internal channel dimension. - kernel_size (int, optional): Base kernel size for the SLL portion. Default is 3. - stride (int, optional): Stride for the skip connection. Default is 2. - groups (int, optional): Number of groups for the convolution. Default is 1. - **kwargs: Additional options (unused).

References
  • Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025). An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures. https://arxiv.org/abs/2501.07930
Source code in orthogonium\layers\conv\SLL\sll_layer.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def __init__(
    self, cin, cout, inner_dim_factor, kernel_size=3, stride=2, groups=1, **kwargs
):
    """
    Extended SLL-based convolutional residual block. Supports arbitrary kernel sizes,
    strides, and changes in the number of channels by integrating additional
    orthogonal convolutions *and* fusing them via `\mathbconv` [1].

    The forward pass follows:

    $$
    y = (\mathbf{K}_{post} \circledast \mathbf{K}_{pre}) \\star x - (\mathbf{K}_{post} \circledast \mathbf{K}^T) \\star (t \\times  \sigma(( \mathbf{K} \circledast \mathbf{K}_{pre}) \\star x + b)),
    $$

    where $\mathbf{K}_{pre}$ and $\mathbf{K}_{post}$ are obtained with AOC.


    <img src="../../assets/SLL_3.png" alt="illustration of SLL x AOC" width="600">



    where the kernel `\kernel{K}` may effectively be expanded by pre/post AOC layers to
    handle stride and channel changes. This approach is described in "Improving
    SDP-based Lipschitz Layers" of [1].

    **Args**:
      - `cin` (int): Number of input channels.
      - `inner_dim_factor` (float): Multiplier for the internal channel dimension.
      - `kernel_size` (int, optional): Base kernel size for the SLL portion. Default is 3.
      - `stride` (int, optional): Stride for the skip connection. Default is 2.
      - `groups` (int, optional): Number of groups for the convolution. Default is 1.
      - `**kwargs`: Additional options (unused).



    References:
        - Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
        An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
        <https://arxiv.org/abs/2501.07930>
    """
    super().__init__()
    inner_kernel_size = kernel_size - (stride - 1)
    self.skip_kernel_size = stride + (stride // 2)
    inner_dim = int(cout * inner_dim_factor)
    self.activation = nn.ReLU()
    self.stride = stride
    self.groups = groups
    self.padding = kernel_size // 2
    self.kernel = nn.Parameter(
        torch.randn(
            inner_dim, cin // self.groups, inner_kernel_size, inner_kernel_size
        )
    )
    parametrize.register_parametrization(
        self,
        "kernel",
        AOLReparametrizer(
            inner_dim,
            groups=groups,
        ),
    )
    self.bias = nn.Parameter(torch.empty(1, inner_dim, 1, 1))
    self.q = nn.Parameter(torch.ones(inner_dim, 1, 1, 1))

    nn.init.xavier_normal_(self.kernel)
    fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.kernel)
    bound = 1 / np.sqrt(fan_in)
    nn.init.uniform_(self.bias, -bound, bound)  # bias init

    self.pre_conv = AdaptiveOrthoConv2d(
        cin, cin, kernel_size=stride, stride=1, bias=False, padding=0, groups=groups
    )
    self.post_conv = AdaptiveOrthoConv2d(
        cin,
        cout,
        kernel_size=stride,
        stride=stride,
        bias=False,
        padding=0,
        groups=groups,
    )

AOLConv2D

Bases: Conv2d

Source code in orthogonium\layers\conv\AOL\aol.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
class AOLConv2D(nn.Conv2d):

    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_size,
        stride=1,
        padding=0,
        dilation=1,
        groups=1,
        bias=True,
        padding_mode="zeros",
        device=None,
        dtype=None,
        niter=1,
    ):
        """
        Almost-Orthogonal Convolution layer. This layer implements the method proposed in [1] to enforce
        almost-orthogonality. While orthogonality is not enforced, the lipschitz constant of the layer
        is guaranteed to be less than 1.

        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            kernel_size (int or tuple): Size of the convolution kernel.
            stride (int or tuple, optional): Stride of the convolution. Default is 1.
            padding (int or tuple, optional): Padding size. Default is 0.
            dilation (int or tuple, optional): Dilation rate. Default is 1.
            groups (int, optional): Number of groups. Default is 1.
            bias (bool, optional): Whether to include a learnable bias. Default is True.
            padding_mode (str, optional): Padding mode. Default is "zeros".
            device (torch.device, optional): Device to store the layer parameters. Default is None.
            dtype (torch.dtype, optional): Data type to store the layer parameters. Default is None.


        References:
            `[1] Prach, B., & Lampert, C. H. (2022).
                   "Almost-orthogonal layers for efficient general-purpose lipschitz networks."
                   ECCV.`<https://arxiv.org/abs/2208.03160>`_
        """
        super().__init__(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            dilation=dilation,
            groups=groups,
            bias=bias,
            padding_mode=padding_mode,
            device=device,
            dtype=dtype,
        )
        self.niter = niter

        parametrize.register_parametrization(
            self,
            "weight",
            MultiStepAOLReparametrizer(
                min(out_channels, in_channels),
                groups=groups,
                niter=niter,
            ),
        )

    def reset_parameters(self) -> None:
        r"""Resets parameters of the module. This includes the weight and bias
        parameters, if they are used.
        """
        super().reset_parameters()
        # # Reset the parametrization
        # init kernel using the orthogonal kernel
        if not (
            self.in_channels // self.groups == 0
            and self.out_channels // self.groups == 0
        ):
            self.kernel = conv_orthogonal_(
                self.weight,
                stride=self.stride,
                groups=self.groups,
            )

__init__(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None, niter=1)

Almost-Orthogonal Convolution layer. This layer implements the method proposed in [1] to enforce almost-orthogonality. While orthogonality is not enforced, the lipschitz constant of the layer is guaranteed to be less than 1.

Parameters:

Name Type Description Default
in_channels int

Number of input channels.

required
out_channels int

Number of output channels.

required
kernel_size int or tuple

Size of the convolution kernel.

required
stride int or tuple

Stride of the convolution. Default is 1.

1
padding int or tuple

Padding size. Default is 0.

0
dilation int or tuple

Dilation rate. Default is 1.

1
groups int

Number of groups. Default is 1.

1
bias bool

Whether to include a learnable bias. Default is True.

True
padding_mode str

Padding mode. Default is "zeros".

'zeros'
device device

Device to store the layer parameters. Default is None.

None
dtype dtype

Data type to store the layer parameters. Default is None.

None
References

[1] Prach, B., & Lampert, C. H. (2022). "Almost-orthogonal layers for efficient general-purpose lipschitz networks." ECCV.https://arxiv.org/abs/2208.03160`_

Source code in orthogonium\layers\conv\AOL\aol.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
def __init__(
    self,
    in_channels,
    out_channels,
    kernel_size,
    stride=1,
    padding=0,
    dilation=1,
    groups=1,
    bias=True,
    padding_mode="zeros",
    device=None,
    dtype=None,
    niter=1,
):
    """
    Almost-Orthogonal Convolution layer. This layer implements the method proposed in [1] to enforce
    almost-orthogonality. While orthogonality is not enforced, the lipschitz constant of the layer
    is guaranteed to be less than 1.

    Args:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        kernel_size (int or tuple): Size of the convolution kernel.
        stride (int or tuple, optional): Stride of the convolution. Default is 1.
        padding (int or tuple, optional): Padding size. Default is 0.
        dilation (int or tuple, optional): Dilation rate. Default is 1.
        groups (int, optional): Number of groups. Default is 1.
        bias (bool, optional): Whether to include a learnable bias. Default is True.
        padding_mode (str, optional): Padding mode. Default is "zeros".
        device (torch.device, optional): Device to store the layer parameters. Default is None.
        dtype (torch.dtype, optional): Data type to store the layer parameters. Default is None.


    References:
        `[1] Prach, B., & Lampert, C. H. (2022).
               "Almost-orthogonal layers for efficient general-purpose lipschitz networks."
               ECCV.`<https://arxiv.org/abs/2208.03160>`_
    """
    super().__init__(
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        padding=padding,
        dilation=dilation,
        groups=groups,
        bias=bias,
        padding_mode=padding_mode,
        device=device,
        dtype=dtype,
    )
    self.niter = niter

    parametrize.register_parametrization(
        self,
        "weight",
        MultiStepAOLReparametrizer(
            min(out_channels, in_channels),
            groups=groups,
            niter=niter,
        ),
    )

reset_parameters()

Resets parameters of the module. This includes the weight and bias parameters, if they are used.

Source code in orthogonium\layers\conv\AOL\aol.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def reset_parameters(self) -> None:
    r"""Resets parameters of the module. This includes the weight and bias
    parameters, if they are used.
    """
    super().reset_parameters()
    # # Reset the parametrization
    # init kernel using the orthogonal kernel
    if not (
        self.in_channels // self.groups == 0
        and self.out_channels // self.groups == 0
    ):
        self.kernel = conv_orthogonal_(
            self.weight,
            stride=self.stride,
            groups=self.groups,
        )

AOLConvTranspose2D

Bases: ConvTranspose2d

Source code in orthogonium\layers\conv\AOL\aol.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
class AOLConvTranspose2D(nn.ConvTranspose2d):

    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_size,
        stride=1,
        padding=0,
        output_padding=0,
        groups=1,
        bias=True,
        dilation=1,
        padding_mode="zeros",
        device=None,
        dtype=None,
        niter=1,
    ):
        """
        Almost-Orthogonal Convolution layer. This layer implements the method proposed in [1] to enforce
        almost-orthogonality. While orthogonality is not enforced, the lipschitz constant of the layer
        is guaranteed to be less than 1.

        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            kernel_size (int or tuple): Size of the convolution kernel.
            stride (int or tuple, optional): Stride of the convolution. Default is 1.
            padding (int or tuple, optional): Padding size. Default is 0.
            output_padding (int or tuple, optional): Additional size added to the output shape. Default is 0.
            groups (int, optional): Number of groups. Default is 1.
            bias (bool, optional): Whether to include a learnable bias. Default is True.
            dilation (int or tuple, optional): Dilation rate. Default is 1.
            padding_mode (str, optional): Padding mode. Default is "zeros".
            device (torch.device, optional): Device to store the layer parameters. Default is None.
            dtype (torch.dtype, optional): Data type to store the layer parameters. Default is None.


        References:
            `[1] Prach, B., & Lampert, C. H. (2022).
                   "Almost-orthogonal layers for efficient general-purpose lipschitz networks."
                   ECCV.`<https://arxiv.org/abs/2208.03160>`_
        """
        super().__init__(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            output_padding=output_padding,
            groups=groups,
            bias=bias,
            dilation=dilation,
            padding_mode=padding_mode,
            device=device,
            dtype=dtype,
        )
        self.niter = niter

        # Register the same AOLReparametrizer
        parametrize.register_parametrization(
            self,
            "weight",
            MultiStepAOLReparametrizer(
                min(out_channels, in_channels), groups=groups, niter=niter
            ),
        )

__init__(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1, padding_mode='zeros', device=None, dtype=None, niter=1)

Almost-Orthogonal Convolution layer. This layer implements the method proposed in [1] to enforce almost-orthogonality. While orthogonality is not enforced, the lipschitz constant of the layer is guaranteed to be less than 1.

Parameters:

Name Type Description Default
in_channels int

Number of input channels.

required
out_channels int

Number of output channels.

required
kernel_size int or tuple

Size of the convolution kernel.

required
stride int or tuple

Stride of the convolution. Default is 1.

1
padding int or tuple

Padding size. Default is 0.

0
output_padding int or tuple

Additional size added to the output shape. Default is 0.

0
groups int

Number of groups. Default is 1.

1
bias bool

Whether to include a learnable bias. Default is True.

True
dilation int or tuple

Dilation rate. Default is 1.

1
padding_mode str

Padding mode. Default is "zeros".

'zeros'
device device

Device to store the layer parameters. Default is None.

None
dtype dtype

Data type to store the layer parameters. Default is None.

None
References

[1] Prach, B., & Lampert, C. H. (2022). "Almost-orthogonal layers for efficient general-purpose lipschitz networks." ECCV.https://arxiv.org/abs/2208.03160`_

Source code in orthogonium\layers\conv\AOL\aol.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
def __init__(
    self,
    in_channels,
    out_channels,
    kernel_size,
    stride=1,
    padding=0,
    output_padding=0,
    groups=1,
    bias=True,
    dilation=1,
    padding_mode="zeros",
    device=None,
    dtype=None,
    niter=1,
):
    """
    Almost-Orthogonal Convolution layer. This layer implements the method proposed in [1] to enforce
    almost-orthogonality. While orthogonality is not enforced, the lipschitz constant of the layer
    is guaranteed to be less than 1.

    Args:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        kernel_size (int or tuple): Size of the convolution kernel.
        stride (int or tuple, optional): Stride of the convolution. Default is 1.
        padding (int or tuple, optional): Padding size. Default is 0.
        output_padding (int or tuple, optional): Additional size added to the output shape. Default is 0.
        groups (int, optional): Number of groups. Default is 1.
        bias (bool, optional): Whether to include a learnable bias. Default is True.
        dilation (int or tuple, optional): Dilation rate. Default is 1.
        padding_mode (str, optional): Padding mode. Default is "zeros".
        device (torch.device, optional): Device to store the layer parameters. Default is None.
        dtype (torch.dtype, optional): Data type to store the layer parameters. Default is None.


    References:
        `[1] Prach, B., & Lampert, C. H. (2022).
               "Almost-orthogonal layers for efficient general-purpose lipschitz networks."
               ECCV.`<https://arxiv.org/abs/2208.03160>`_
    """
    super().__init__(
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        padding=padding,
        output_padding=output_padding,
        groups=groups,
        bias=bias,
        dilation=dilation,
        padding_mode=padding_mode,
        device=device,
        dtype=dtype,
    )
    self.niter = niter

    # Register the same AOLReparametrizer
    parametrize.register_parametrization(
        self,
        "weight",
        MultiStepAOLReparametrizer(
            min(out_channels, in_channels), groups=groups, niter=niter
        ),
    )

MultiStepAOLReparametrizer

Bases: Module

Source code in orthogonium\layers\conv\AOL\aol.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
class MultiStepAOLReparametrizer(nn.Module):
    def __init__(self, nb_features, groups, niter=4):
        super(MultiStepAOLReparametrizer, self).__init__()
        self.groups = groups
        self.nb_features = nb_features
        self.niter = niter
        self.q = nn.Parameter(torch.ones(nb_features, 1, 1, 1))

    def forward(self, kernel):
        co, cig, ks, ks2 = kernel.shape
        if co // self.groups >= cig:
            kernel = transpose_kernel(kernel, self.groups, flip=True)
        kkt = kernel
        log_curr_norm = 0
        for i in range(self.niter):
            kkt_norm = kkt.norm().detach()
            kkt = kkt / kkt_norm
            log_curr_norm = 2 * (log_curr_norm + kkt_norm.log())
            kkt = fast_matrix_conv(
                transpose_kernel(kkt, self.groups, flip=True), kkt, self.groups
            )

        inverse_power = 2 ** (-self.niter)
        t = torch.abs(kkt)
        q = torch.exp(self.q)
        q_inv = torch.exp(-self.q)
        t = q_inv * t * q
        t = t.sum((1, 2, 3)).pow(inverse_power)
        norm = torch.exp(log_curr_norm * inverse_power)
        t = t * norm
        t = t.reshape(-1, 1, 1, 1)
        kernel = kernel / t
        if co // self.groups >= cig:
            kernel = transpose_kernel(kernel, self.groups, flip=True)
        return kernel

    def right_inverse(self, kernel):
        return kernel

    def reset_parameters(self):
        """
        Resets the parameters of the reparametrizer.
        """
        # Reset the q parameter to its initial value
        self.q.data.fill_(1.0)

reset_parameters()

Resets the parameters of the reparametrizer.

Source code in orthogonium\layers\conv\AOL\aol.py
79
80
81
82
83
84
def reset_parameters(self):
    """
    Resets the parameters of the reparametrizer.
    """
    # Reset the q parameter to its initial value
    self.q.data.fill_(1.0)

AdaptiveSOCConv2d(in_channels, out_channels, kernel_size, stride=1, padding='same', dilation=1, groups=1, bias=True, padding_mode='circular', ortho_params=OrthoParams())

Factory function to create an orthogonal convolutional layer, selecting the appropriate class based on kernel size and stride. This is a modified implementation of the Skew orthogonal convolution [1], with significant modification from the original paper:

  • This implementation provide an explicit kernel (which is larger the original kernel size) so the forward is done in a single iteration. As described in [2].
  • This implementation avoid the use of channels padding to handle case where cin != cout. Similarly, stride is handled natively using the ad adaptive scheme.
  • the fantastic four method is replaced by AOL which allows to reduce the number of iterations required to converge.

It aims to be more scalable to large networks and large image sizes, while enforcing orthogonality in the convolutional layers. This layer also intend to be compatible with all the feature of the nn.Conv2d class (e.g., striding, dilation, grouping, etc.). This method has an explicit kernel, which means that the forward operation is equivalent to a standard convolutional layer, but the weight are constrained to be orthogonal.

Note
  • this implementation changes the size of the kernel, which also change the padding semantics. Please adjust the padding according to the kernel size and the number of iterations.
  • current unit testing use a tolerance of 8e-2 sor this layer can be expected to be 1.08 lipschitz continuous. Similarly, the stable rank is evaluated loosely (must be greater than 0.5).

Key Features:

- Enforces orthogonality, preserving gradient norms.
- Supports native striding, dilation, grouped convolutions, and flexible padding.

Behavior:

- When kernel_size == stride, the layer is an `RKOConv2d`.
- When stride == 1, the layer is a `FastBlockConv2d`.
- Otherwise, the layer is a `BcopRkoConv2d`.

Parameters:

Name Type Description Default
in_channels int

Number of input channels.

required
out_channels int

Number of output channels.

required
kernel_size _size_2_t

Size of the convolution kernel.

required
stride _size_2_t

Stride of the convolution. Default is 1.

1
padding str or _size_2_t

Padding mode or size. Default is "same".

'same'
dilation _size_2_t

Dilation rate. Default is 1.

1
groups int

Number of blocked connections from input to output channels. Default is 1.

1
bias bool

Whether to include a learnable bias. Default is True.

True
padding_mode str

Padding mode. Default is "circular".

'circular'
ortho_params OrthoParams

Parameters to control orthogonality. Default is OrthoParams().

OrthoParams()

Returns:

Type Description
Conv2d

A configured instance of nn.Conv2d (one of RKOConv2d, FastBlockConv2d, or BcopRkoConv2d).

Raises:

Type Description
`ValueError`

If kernel_size < stride, as orthogonality cannot be enforced.

References
  • [1] Singla, S., & Feizi, S. (2021, July). Skew orthogonal convolutions. In International Conference on Machine Learning (pp. 9756-9766). PMLR.https://arxiv.org/abs/2105.11417
  • [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025). An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures. https://arxiv.org/abs/2501.07930
Source code in orthogonium\layers\conv\adaptiveSOC\ortho_conv.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
def AdaptiveSOCConv2d(
    in_channels: int,
    out_channels: int,
    kernel_size: _size_2_t,
    stride: _size_2_t = 1,
    padding: Union[str, _size_2_t] = "same",
    dilation: _size_2_t = 1,
    groups: int = 1,
    bias: bool = True,
    padding_mode: str = "circular",
    ortho_params: OrthoParams = OrthoParams(),
) -> nn.Conv2d:
    """
    Factory function to create an orthogonal convolutional layer, selecting the appropriate class based on kernel
    size and stride. This is a modified implementation of the `Skew orthogonal convolution` [1], with significant
    modification from the original paper:


    - This implementation provide an explicit kernel (which is larger the original kernel size) so the forward is done
        in a single iteration. As described in [2].
    - This implementation avoid the use of channels padding to handle case where cin != cout. Similarly, stride is
        handled natively using the ad adaptive scheme.
    - the fantastic four method is replaced by AOL which allows to reduce the number of iterations required to
        converge.

    It aims to be more scalable to large networks and large image sizes, while enforcing orthogonality in the
    convolutional layers. This layer also intend to be compatible with all the feature of the `nn.Conv2d` class
    (e.g., striding, dilation, grouping, etc.). This method has an explicit kernel, which means that the forward
    operation is equivalent to a standard convolutional layer, but the weight are constrained to be orthogonal.

    Note:
        - this implementation changes the size of the kernel, which also change the padding semantics. Please adjust
            the padding according to the kernel size and the number of iterations.
        - current unit testing use a tolerance of 8e-2 sor this layer can be expected to be 1.08 lipschitz continuous.
            Similarly, the stable rank is evaluated loosely (must be greater than 0.5).

    Key Features:
    -------------
        - Enforces orthogonality, preserving gradient norms.
        - Supports native striding, dilation, grouped convolutions, and flexible padding.

    Behavior:
    -------------
        - When kernel_size == stride, the layer is an `RKOConv2d`.
        - When stride == 1, the layer is a `FastBlockConv2d`.
        - Otherwise, the layer is a `BcopRkoConv2d`.

    Arguments:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        kernel_size (_size_2_t): Size of the convolution kernel.
        stride (_size_2_t, optional): Stride of the convolution. Default is 1.
        padding (str or _size_2_t, optional): Padding mode or size. Default is "same".
        dilation (_size_2_t, optional): Dilation rate. Default is 1.
        groups (int, optional): Number of blocked connections from input to output channels. Default is 1.
        bias (bool, optional): Whether to include a learnable bias. Default is True.
        padding_mode (str, optional): Padding mode. Default is "circular".
        ortho_params (OrthoParams, optional): Parameters to control orthogonality. Default is `OrthoParams()`.

    Returns:
        A configured instance of `nn.Conv2d` (one of `RKOConv2d`, `FastBlockConv2d`, or `BcopRkoConv2d`).

    Raises:
        `ValueError`: If kernel_size < stride, as orthogonality cannot be enforced.


    References:
        - [1] Singla, S., & Feizi, S. (2021, July). Skew orthogonal convolutions. In International Conference
        on Machine Learning (pp. 9756-9766). PMLR.<https://arxiv.org/abs/2105.11417>
        - [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
        An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
        <https://arxiv.org/abs/2501.07930>
    """
    if kernel_size < stride:
        raise ValueError(
            "kernel size must be smaller than stride. The set of orthonal convolutions is empty in this setting."
        )
    if kernel_size == stride:
        convclass = RKOConv2d
    elif stride == 1:
        convclass = FastSOC
    else:
        convclass = SOCRkoConv2d
    return convclass(
        in_channels,
        out_channels,
        kernel_size,
        stride,
        padding,
        dilation,
        groups,
        bias,
        padding_mode,
        # ortho_params=ortho_params,
    )

AdaptiveSOCConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1, padding_mode='zeros', ortho_params=OrthoParams())

Factory function to create an orthogonal transposed convolutional layer, selecting the appropriate class based on kernel size and stride. This is a modified implementation of the Skew orthogonal convolution [1], with significant modification from the original paper:

  • This implementation provide an explicit kernel (which is larger the original kernel size) so the forward is done in a single iteration. As described in [2].
  • This implementation avoid the use of channels padding to handle case where cin != cout. Similarly, stride is handled natively using the ad adaptive scheme.
  • the fantastic four method is replaced by AOL which allows to reduce the number of iterations required to converge.

It aims to be more scalable to large networks and large image sizes, while enforcing orthogonality in the convolutional layers. This layer also intend to be compatible with all the feature of the nn.Conv2d class (e.g., striding, dilation, grouping, etc.). This method has an explicit kernel, which means that the forward operation is equivalent to a standard convolutional layer, but the weight are constrained to be orthogonal.

Note
  • this implementation changes the size of the kernel, which also change the padding semantics. Please adjust the padding according to the kernel size and the number of iterations.
  • current unit testing use a tolerance of 8e-2 sor this layer can be expected to be 1.08 lipschitz continuous. Similarly, the stable rank is evaluated loosely (must be greater than 0.5).

Key Features:

- Enforces orthogonality, preserving gradient norms.
- Supports native striding, dilation, grouped convolutions, and flexible padding.

Behavior:

- When kernel_size == stride, the layer is an `RKOConv2d`.
- When stride == 1, the layer is a `FastBlockConv2d`.
- Otherwise, the layer is a `BcopRkoConv2d`.

Parameters:

Name Type Description Default
in_channels int

Number of input channels.

required
out_channels int

Number of output channels.

required
kernel_size _size_2_t

Size of the convolution kernel.

required
stride _size_2_t

Stride of the convolution. Default is 1.

1
padding str or _size_2_t

Padding mode or size. Default is "same".

0
dilation _size_2_t

Dilation rate. Default is 1.

1
groups int

Number of blocked connections from input to output channels. Default is 1.

1
bias bool

Whether to include a learnable bias. Default is True.

True
padding_mode str

Padding mode. Default is "circular".

'zeros'
ortho_params OrthoParams

Parameters to control orthogonality. Default is OrthoParams().

OrthoParams()

Returns:

Type Description
ConvTranspose2d

A configured instance of nn.Conv2d (one of RKOConv2d, FastBlockConv2d, or BcopRkoConv2d).

Raises:

Type Description
`ValueError`

If kernel_size < stride, as orthogonality cannot be enforced.

References
  • [1] Singla, S., & Feizi, S. (2021, July). Skew orthogonal convolutions. In International Conference on Machine Learning (pp. 9756-9766). PMLR.https://arxiv.org/abs/2105.11417
  • [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025). An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures. https://arxiv.org/abs/2501.07930
Source code in orthogonium\layers\conv\adaptiveSOC\ortho_conv.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
def AdaptiveSOCConvTranspose2d(
    in_channels: int,
    out_channels: int,
    kernel_size: _size_2_t,
    stride: _size_2_t = 1,
    padding: _size_2_t = 0,
    output_padding: _size_2_t = 0,
    groups: int = 1,
    bias: bool = True,
    dilation: _size_2_t = 1,
    padding_mode: str = "zeros",
    ortho_params: OrthoParams = OrthoParams(),
) -> nn.ConvTranspose2d:
    """
    Factory function to create an orthogonal transposed convolutional layer, selecting the appropriate class based on
    kernel size and stride. This is a modified implementation of the `Skew orthogonal convolution` [1], with significant
    modification from the original paper:

    - This implementation provide an explicit kernel (which is larger the original kernel size) so the forward is done
        in a single iteration. As described in [2].
    - This implementation avoid the use of channels padding to handle case where cin != cout. Similarly, stride is
        handled natively using the ad adaptive scheme.
    - the fantastic four method is replaced by AOL which allows to reduce the number of iterations required to
        converge.

    It aims to be more scalable to large networks and large image sizes, while enforcing orthogonality in the
    convolutional layers. This layer also intend to be compatible with all the feature of the `nn.Conv2d` class
    (e.g., striding, dilation, grouping, etc.). This method has an explicit kernel, which means that the forward
    operation is equivalent to a standard convolutional layer, but the weight are constrained to be orthogonal.

    Note:
        - this implementation changes the size of the kernel, which also change the padding semantics. Please adjust
            the padding according to the kernel size and the number of iterations.
        - current unit testing use a tolerance of 8e-2 sor this layer can be expected to be 1.08 lipschitz continuous.
            Similarly, the stable rank is evaluated loosely (must be greater than 0.5).

    Key Features:
    -------------
        - Enforces orthogonality, preserving gradient norms.
        - Supports native striding, dilation, grouped convolutions, and flexible padding.

    Behavior:
    -------------
        - When kernel_size == stride, the layer is an `RKOConv2d`.
        - When stride == 1, the layer is a `FastBlockConv2d`.
        - Otherwise, the layer is a `BcopRkoConv2d`.

    Arguments:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        kernel_size (_size_2_t): Size of the convolution kernel.
        stride (_size_2_t, optional): Stride of the convolution. Default is 1.
        padding (str or _size_2_t, optional): Padding mode or size. Default is "same".
        dilation (_size_2_t, optional): Dilation rate. Default is 1.
        groups (int, optional): Number of blocked connections from input to output channels. Default is 1.
        bias (bool, optional): Whether to include a learnable bias. Default is True.
        padding_mode (str, optional): Padding mode. Default is "circular".
        ortho_params (OrthoParams, optional): Parameters to control orthogonality. Default is `OrthoParams()`.

    Returns:
        A configured instance of `nn.Conv2d` (one of `RKOConv2d`, `FastBlockConv2d`, or `BcopRkoConv2d`).

    Raises:
        `ValueError`: If kernel_size < stride, as orthogonality cannot be enforced.


    References:
        - [1] Singla, S., & Feizi, S. (2021, July). Skew orthogonal convolutions. In International Conference
        on Machine Learning (pp. 9756-9766). PMLR.<https://arxiv.org/abs/2105.11417>
        - [2] Boissin, T., Mamalet, F., Fel, T., Picard, A. M., Massena, T., & Serrurier, M. (2025).
        An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures.
        <https://arxiv.org/abs/2501.07930>
    """
    if kernel_size < stride:
        raise ValueError(
            "kernel size must be smaller than stride. The set of orthonal convolutions is empty in this setting."
        )
    if kernel_size == stride:
        convclass = RkoConvTranspose2d
    elif stride == 1:
        convclass = SOCTranspose
    else:
        convclass = SOCRkoConvTranspose2d
    return convclass(
        in_channels,
        out_channels,
        kernel_size,
        stride,
        padding,
        output_padding,
        groups,
        bias,
        dilation,
        padding_mode,
        # ortho_params=ortho_params,
    )