【论文实现】以SVD的分解形式进行深度神经网络的训练（PyTorch）

年轻人少熬夜

囚生CY

14413人浏览 · 2021-05-30 01:51:08

囚生CY · 2021-05-30 01:51:08 发布

序言

本文是针对笔者前一阵子写的博客【数值分析×机器学习】以SVD的分解形式进行深度神经网络的训练基于 $\rm PyTorch$ 的一个实现流程，以及对在实现过程中存在的问题与解决做一个记录。大致的思想是将深度神经网络中各个网络层（主要指全连接层与卷积层）对应的权重矩阵进行低秩分解，以简化模型复杂度以及提升模型优化的收敛速度，有兴趣地可以去看一下链接中对应的论文，笔者阅读后受益匪浅：

英文标题：Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification

中文标题：学习低秩深度神经网络通过奇异向量正交正则化与奇异值稀疏化

论文下载链接：arxiv@1906.06925

具体而言， $\text{SVD}$ 指对于权重矩阵 $W\in\R^{m\times n}$ ，可以分解为 $U\in\R^{m\times r},V\in\R^{n\times r},s\in\R^{s}$ 三部分，其中 $U$ 与 $V$ 是正交矩阵（即奇异向量矩阵），在满秩的情况下，即 $r=\min(m,n)$ ， $W$ 可以精确地被重构成 $W=U\text{diag}(s)V^\top$ ，问题在于这种分解方法是极为耗时的，如果在模型训练的每一次迭代中都去对深度神经网络中的每一个权重矩阵 $W$ 进行 $\text{SVD}$ ，那么这个代价相比于模型的简化与优化速度的提升显然是很划不来的，因此一个想法是在损失函数中添加正则项，来促进奇异向量矩阵 $U, V$ 能够逼近正交矩阵，且 $s$ 尽可能的稀疏，通过阅读上述论文可以知道损失函数具有如下的形式：
$L(U,s,V)=L_T\left(\text{diag}\left(\sqrt{|s|}\right)V^\top,U\text{diag}\left(\sqrt{|s|}\right)\right)+\lambda_o\sum_{l=1}^DL_o(U_l,V_l)+\lambda_s\sum_{l=1}^DL_s(s_l)\tag{1}$
其中：

$L_T$ 是在分解的网络层上的训练损失；
$L_o$ 是正交正则项：
$L_o(U,V)=\frac1{r^2}\left(\left\|U^\top U-I\right\|_F^2-\left\|V^\top V-I\right\|_F^2\right)\tag{2}$
目的是为了使得奇异向量矩阵 $U, V$ 逼近正交矩阵；
$U_l,V_l,s_l$ 是网络层 $l$ 的奇异向量矩阵与奇异值向量， $D$ 即为总的网络层数；
$L_s$ 是稀疏导出正则化的损失，这里对比了 $L_s=L^H$ 以及 $L_s=L^1$ 的性能：
$L^H(s)=\frac{\|s\|_1}{\|s\|_2}=\frac{\sum_i|s_i|}{\sqrt{\sum_{i}s_i^2}}\\ L^1(s)=\|s\|_1\tag{3}$
目的是为了使得奇异值向量尽可能的稀疏；
$\lambda_o$ 与 $\lambda_s$ 是衰减参数（decay parameters），其中 $\lambda_o$ 可以选成一个很大的正数，以确保正交性， $\lambda_s$ 则可以在精确度与FLOPs间进行权衡以得到一个低秩模型，具体的超参数设定原论文附录中有提及，此处不加以赘述。

原论文作者是基于 $\rm CIFAR10$ 数据集，对各种 $\rm ResNet$ 以及 $\rm ImageNet$ 进行 $\rm SVD$ 训练实验，但是原论文的项目代码并没有公开，本文将从一个简单的示例开始，试图实现这一流程，并对实现过程中的问题做记录。

本文将涉及对 $\text{PyTorch}$ 源码的解析，笔者个人感觉本文的内容还是对于 $\rm PyTorch$ 初学者入门神经网络模型构建来说应该还是比较具有参考意义的。

$\rm Prescript$ ：实现仍然可能存在问题，后续可能会不定期进一步修正。

序言
1 快速上手：从全连接层的 $\text{SVD}$ 开始
2 如何重写 $\text{SVD}$ 训练形式下的网络层？
3 如何重写 $\text{SVD}$ 训练形式下的损失函数？
4 如何重写 $\text{SVD}$ 训练形式下的神经网络模型？（以 $\rm ResNet18$ 为例）
2021年6月22日更新
$\rm PostScript$

1 快速上手：从全连接层的 $\text{SVD}$ 开始

首先让我们来分析一下需要解决的问题：

如何重写 $\text{SVD}$ 训练形式下的各个网络层？具体而言是如何重写全连接层（torch.nn.Linear）与二维卷积层（torch.nn.Conv2d）的代码逻辑，使之满足 $\text{SVD}$ 训练的形式。
如何重写损失函数？具体而言是将序言部分中提及的式 $(2)$ 与式 $(3)$ 添加到损失函数中，这在代码实现上将涉及如何将特定的模型参数传入到损失函数中进行计算。

可能这还是有点抽象，来看一个非常简单的 $\text{demo}$ ：

下面的代码是一个非常简单的模型，它由两个全连接层（self.linear与self.linear1）构成，每个全连接层的输出结果通过 $\text{ReLU}$ 函数激活：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch as th

class Net(th.nn.Module):
    
    def __init__(self, input_dim, hidden_dim):
        super(SVDNet, self).__init__()
        self.linear = th.nn.Linear(input_dim, hidden_dim, bias=False)
        self.relu1 = th.nn.ReLU()
        self.linear1 = th.nn.Linear(hidden_dim, 1, bias=False)
        self.relu2 = th.nn.ReLU()

    def forward(self, x):
        x = self.linear(x)
        x = self.relu1(x)
        x = self.linear1(x)
        x = self.relu2(x)
        return x

我们的目的是希望将上述模型中的第一个全连接层改写为 $\rm SVD$ 的形式。

注意torch.nn.Linear(in_features, out_features)的权重矩阵的形状（shape）是(out_features, in_features)（通过self.linear.weight.shape可以查看），那么一个全连接层的权重矩阵可以做如下的 $\text{SVD}$ ：
$W=U\cdot S\cdot V^\top$
其中 $W\in\R^{\text{infeatures}\times\text{outfeatures}},U\in\R^{\text{infeatures}\times\text{rank}},S\in\R^{\text{rank}\times\text{rank}},V\in\R^{\text{infeatures}\times\text{rank}}$ ，不妨假定为满秩分解，则有 $\rm rank=\min\{\text{infeatures, outfeatures}\}$

显然 $U$ 和 $V^\top$ 可以视为两个不带截距项的全连接层（bias=False），奇异值向量得到的对角矩阵 $S$ 本质是在做点积，可以用下面自定义的DotProduct层来表示，于是我们可以得到上述模型的 $\text{SVD}$ 形式（即上述代码中的self.linear全连接层被self.orthogonal_linear1，self.diag_dotproduct1，self.orthogonal_linear2三个分解得到的层取代）：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch as th

class DotProduct(th.nn.Module):
    
    def __init__(self, in_features, bias=True):
        super(DotProduct, self).__init__()
        self.bias_flag = bias
        self.weight = th.nn.Parameter(th.rand(in_features, ))
        if bias:
            self.bias = th.nn.Parameter(th.rand(in_features, ))

    def forward(self, x):
        x = self.weight * x
        if self.bias_flag:
            x = x + self.bias
        return x

class SVDNet(th.nn.Module):

    def __init__(self, input_dim, hidden_dim):
        super(SVDNet, self).__init__()
        rank = min(hidden_dim, input_dim)
        self.orthogonal_linear1 = th.nn.Linear(input_dim, rank, bias=False)
        self.diag_dotproduct1 = DotProduct(rank, bias=False)
        self.orthogonal_linear2 = th.nn.Linear(rank, hidden_dim, bias=False)
        self.relu1 = th.nn.ReLU()
        self.linear1 = th.nn.Linear(hidden_dim, 1, bias=False)
        self.relu2 = th.nn.ReLU()

    def forward(self, x):
        x = self.orthogonal_linear1(x)
        x = self.diag_dotproduct1(x)
        x = self.orthogonal_linear2(x)
        x = self.relu1(x)
        x = self.linear1(x)
        x = self.relu2(x)
        return x

接下来需要定义损失函数，几个注意点：

前向传播函数forward的参数中将模型model传入，然后通过生成器model.named_parameters()来提取模型中的所有参数。
由于并非所有的模型参数都需要进行正则化（如我们并不需要对self.linear1.weight进行正则化），所以在设置参数名称时默认以orthogonal_开头的参数需要进行正交正则化（如式 $(2)$ 所示），以diag_开头的参数需要进行稀疏导出正则化（如式 $(3)$ 所示），这看起来似乎有点蠢，但是笔者也没有想到很好的办法，事实上在之后网络层重写上笔者也采用的相同的策略。
此处的损失函数值由两部分构成：error（即均方误差）和regularizer（即正则项）。

具体代码如下所示：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch as th

class SVDLoss(th.nn.Module):

    def __init__(self):
        super(SVDLoss, self).__init__()

    def forward(self, y_pred, y_true, model, regularizer_weights=[1, 1]):
        regularizer = th.zeros(1, )
        for name, param in model.named_parameters():
            if name.startswith('orthogonal_'):
                regularizer += self.singular_vectors_orthogonality_regularizer(param) * regularizer_weights[0]
            if name.startswith('diag_'):
                regularizer += self.singular_values_sparsity_inducing_regularizer(param) * regularizer_weights[1]
        error = y_pred - y_true
        loss = th.mm(error.t(), error) / error.shape[0] + regularizer
        return loss
    
    def singular_vectors_orthogonality_regularizer(self, x):
        return th.norm(th.mm(x.t(), x) - th.eye(x.shape[1]), p='fro') / x.shape[1] / x.shape[1]
    
    def singular_values_sparsity_inducing_regularizer(self, x):
        # return th.norm(x, 1)
        return th.norm(x, 1) / th.norm(x, 2)

最后我们随机生成一些数据集简单做一个模型训练的 $\rm demo$ ：

# -*- coding: utf-8- *-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch as th
from torch.utils.data import DataLoader, TensorDataset
from torch.autograd import Variable

n_samples = 65536
input_dim = 1024
hidden_dim = 128
n_epochs = 5

model = SVDNet(input_dim, hidden_dim)
loss_func = SVDLoss()
optimizer = th.optim.Adam(model.parameters(), lr=0.001)

X = th.rand(n_samples, input_dim)
Y = th.rand(n_samples, 1)

dataset = TensorDataset(X, Y)
loader = DataLoader(dataset, batch_size=32, shuffle=False)

for epoch in range(1, 1 + n_epochs):
    print('='*32, epoch, '='*32)
    for x, y in loader:
        batch_x = Variable(x)
        batch_y = Variable(y)
        optimizer.zero_grad()
        outputs = model(batch_x)
        print(outputs.shape, batch_y.shape)
        loss = loss_func(outputs, batch_y, model)
        print(loss)
        loss.backward()
        optimizer.step()

上述数据集是随机生成的，模型也是最平凡的形式，仅作 $\rm demo$ 使用，并无实际意义。

是否已经开始上手，接下来让我们来看一些稍微有趣点的东西吧。

2 如何重写 $\text{SVD}$ 训练形式下的网络层？

期初笔者有想过是否可以写一个函数，函数的参数是一个 $\rm PyTorch$ 下模型类（即继承自torch.nn.Module的模块），函数的返回值也是一个类，该类是将前面提到的参数（模型类）通过某种变换转为 $\text{SVD}$ 训练形式下的等价形式。

然后笔者发现这种写法不太现实的，因为很难在函数中根据传入的参数（模型类）来重写转换后的forward函数逻辑。

此路不通，就稍微增加一下模型转换时的人力修改成本，即改写各个模块的转换后的类即可，这时就会发现如果直接继承对应的模块类（torch.nn.Linear和torch.nn.Conv2d）而非传统地继承torch.nn.Module，将会大大减少代码量。

事实上原论文中只给出了对于二维卷积层的转换逻辑，但是因为全连接层的转换非常简单（如上一节所述），这里也顺带提一下。

2.1 全连接层重写

先从比较简单的全连接层开始。

参考E:\Anaconda3\Lib\site-packages\torch\nn\modules\目录下的linear.py，可以获得全连接层的源码：

class Linear(Module):
    
    __constants__ = ['in_features', 'out_features']
    in_features: int
    out_features: int
    weight: Tensor

    def __init__(self, in_features: int, out_features: int, bias: bool = True) -> None:
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(torch.Tensor(out_features, in_features))
        if bias:
            self.bias = Parameter(torch.Tensor(out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self) -> None:
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias)

    def extra_repr(self) -> str:
        return 'in_features={}, out_features={}, bias={}'.format(
            self.in_features, self.out_features, self.bias is not None
        )

通过继承Linear类，可以非常简单地实现 $\text{SVD}$ 训练形式下的全连接层重写，这里我们尽量将保持构造参数的形式，这样在之后的模型转换时只需要将所有的Linear替换成LinearSVD即可（就是为了省事），当然也不需要像上面写得那么复杂：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch
from torch.nn import functional as F

class LinearSVD(torch.nn.Linear):
	"""线性层的SVD形式"""
	def __init__(self, in_features, out_features, bias=True):
		super(LinearSVD, self).__init__(in_features, out_features, bias)
		rank = min(in_features, out_features)
		self.svd_weight_matrix_u = torch.nn.Parameter(torch.Tensor(out_features, rank))
		self.svd_weight_matrix_v = torch.nn.Parameter(torch.Tensor(in_features, rank))
		self.svd_weight_vector_s = torch.nn.Parameter(torch.Tensor(rank, ))
		
	def forward(self, input):
		weight = torch.mm(torch.mm(self.svd_weight_matrix_u, torch.diag(self.svd_weight_vector_s)), self.svd_weight_matrix_v.t())
		return F.linear(input, weight, self.bias)

非常简单，不再赘述细节，注意tensor.t()是二维张量转置的简单写法，另外只有套上torch.nn.Parameter才能确保该参数可以在model.named_parameters()中输出以及在模型训练时被优化，否则它就成了一个固定参数。

2.2 二维卷积层重写

2.2.1 二维卷积层原理回顾

考虑到一些朋友（比如笔者自己）可能已经忘了二维卷积层的原理，笔者在这里简要说一下卷积层的原理，具体了解请自行百度，互联网上关于卷积层的介绍应该是非常丰富的。

卷积层的核心是卷积核（kernel），卷积核可以看作是一个窗口，从上到下，从左到右地依次遍历输入图像，每次得到图像的一小块（与卷积核大小相同），将这一小块的图像数据与卷积核的对应位置的元素相乘，然后累和得到输出图像中的一个标量。

卷积核的数量可以不止一个，卷积核的数量即输出图像的管道数（out-channels），由于每一个卷积核可以输出一张图，因此out-channels也可以称为是特征图（feature map）的数量。

这里笔者给一个简单的 $\text{demo}$ 进一步说明卷积层的运算逻辑：

import torch

# 卷积层权重的形状：(out_channels, in_channels, kernel_size, kernel_size)
conv = torch.nn.Conv2d(in_channels=2, out_channels=4, kernel_size=5, bias=False)
print(conv.weight.shape)

# 调用卷积层的输出结果
x = torch.ones((1, 2, 5, 5))
print(conv(x))

# 根据卷积层的权重自己算一遍输出结果
for i in range(conv.weight.shape[0]):
    for j in range(conv.weight.shape[1]):
        print(sum(sum(torch.FloatTensor(conv.weight[i, j, :, :]) * torch.ones((5,5)))))

运行输出结果为：

torch.Size([4, 2, 5, 5])
tensor([[[[-0.1095]],

         [[-1.1097]],

         [[-0.0709]],

         [[-0.3359]]]], grad_fn=<MkldnnConvolutionBackward>)
tensor(-0.5442, grad_fn=<AddBackward0>)
tensor(0.4347, grad_fn=<AddBackward0>)
tensor(-0.5026, grad_fn=<AddBackward0>)
tensor(-0.6072, grad_fn=<AddBackward0>)
tensor(-0.3988, grad_fn=<AddBackward0>)
tensor(0.3279, grad_fn=<AddBackward0>)
tensor(-0.1677, grad_fn=<AddBackward0>)
tensor(-0.1682, grad_fn=<AddBackward0>)

查看卷积层权重conv.weight的shape可知卷积层权重的形状为(out_channels, in_channels, kernel_size, kernel_size)
对比调用卷积层的输出结果与根据卷积层的权重自己算一遍的输出结果，可以看到 $- 0.1095 = - 0.5442 + 0.4347$ ，其余均可类推，即每个卷积核会将in_channels个的 $\text{element-wise}$ 乘积累和的结果全部加起来得到输出。

此外本质上全连接层是一种特殊的卷积层（即卷积核的大小与图像大小相同），当然卷积层也可以转换回全连接层，这里笔者做了一个简单的实现（笔者本来以为是有用的，结果发现屁用没有，因为论文中就是对卷积核进行 $\text{SVD}$ ，根本不需要卷积层实际意义上的权重矩阵。但是写都写了，还写得挺认真的，不挂出来太可惜了，应该没写错吧），本质上就是把全连接层权重矩阵的每个元素找到对应的卷积核上的位置：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch
from torch import nn

# 将给定的二维卷积层转换为线性层
def conv2d_to_linear(conv2d, input_height, input_width):
	# 卷积层输出维度的计算公式：$O=\frac{W-K+2P}{S}+1$
	def _clac_output_size(_input_size, _kernel_size, _stride_size, _padding_size=0):
		_output_size = (_input_size - _kernel_size + _padding_size * 2) / _stride_size + 1
		return int(_output_size)																	# 一般来说，padding的目的是使得
	
	# 将三维索引转成一维索引
	def _index3d_to_index1d(_channel, _height, _width, _height_dim, _width_dim):
		return _channel * _height_dim * _width_dim + _height * _width_dim + _width
	
	# 将一维索引转回三维索引
	def _index1d_to_index3d(_index, _height_dim, _width_dim):
		_size = _height_dim * _width_dim
		_channel = _index // _size
		_height = (_index - _channel * _size) // _width_dim
		_width = _index - _channel * _size - _height * _width_dim
		return _channel, _height, _width

	conv_weight = conv2d.weight																		# 卷积层的权重矩阵
	out_channels, in_channels, kernel_height, kernel_width = conv_weight.shape						# 卷积层权重张量的形状：[输出管道数，输入管道数，卷积核高度，卷积核宽度]
	stride_height, stride_width = conv2d.stride														# 高度与宽度方向上卷积核每次移动的步长
	padding_height, padding_width = conv2d.padding													# 高度与宽度方向上输入矩阵
	output_height = _clac_output_size(input_height, kernel_height, stride_height, padding_height)	# 每个管道上的输出高度
	output_width = _clac_output_size(input_width, kernel_width, stride_width, padding_width)		# 每个管道上的输出宽度
	input_dim = in_channels * input_height * input_width											# 输入的总维度数（将输入张量拉平为向量）
	output_dim = out_channels * output_height * output_width										# 输出的总维度数（将输入张量拉平为向量）	
	linear_weight = torch.zeros((output_dim, input_dim))											# 初始化线性层的权重矩阵全零
	for output_index in range(output_dim):															# 遍历每一个输出维度（一维索引）
		_output_channel, _output_height, _output_width = _index1d_to_index3d(output_index, 
																			 output_height, 
																			 output_width)			# 将对应的一维索引转回成三维索引
		
		# 因为有stride的存在，输出点对应的起始坐标未必和输入点对应的起始坐标一致
		start_height = _output_height * stride_height												# 对应_output_height的输入高度方向的起始坐标								
		start_width = _output_width * stride_width													# 对应_output_width的输入宽度方向的起始坐标				
		for _input_height in range(start_height, start_height + kernel_height):						# 遍历此时卷积核对应的所有输入高度方向上的坐标
			for _input_width in range(start_width, start_width + kernel_width):						# 遍历此时卷积核对应的所有输入宽度方向上的坐标
				for in_channel in range(in_channels):												# 遍历所有的管道
					input_index = _index3d_to_index1d(in_channel, 
													  _input_height, 
													  _input_width, 
													  input_height, 
													  input_width)									# 找到对应的输出维度上的一维索引
					linear_weight[output_index, input_index] = conv_weight[_output_channel, 
																			in_channel, 
																		   _input_height - start_height, 
																		   _input_width - start_width]																				# 修正线性层的权重矩阵
	return linear_weight

到此为止，卷积层的原理应该基本上已经透彻。

2.2.2 二维卷积层重写实现

在【数值分析×机器学习】以SVD的分解形式进行深度神经网络的训练的3.1节中有写如何改写卷积层，这里直接复制过来：

对于卷积层而言，卷积核 $\mathcal{K}\in\R^{n\times c\times w\times h}$ 可以表示为一个四维张量，其中 $n, c, w, h$ 分别代表过滤器（filters）的数量，输入管道（channels）的数量，过滤器的宽和高，本文只要使用空间级（spatial-wise，参考文献 $[15]$ ）或者管道级（channel-wise，参考文献 $[39]$ ）分解的方法来对卷积层进行分解（因为前人用下来效果很好）：

首先将 $\mathcal{K}$ 重构（reshape）成二维矩阵 $\hat K\in\R^{n\times cwh}$ ；

然后 $\hat K$ 通过SVD得到 $U\in\R^{n\times r},V\in\R^{cwh\times r},s\in\R^r$ ，其中 $U$ 和 $V$ 是正交矩阵， $r=\min(n,cwh)$ ；

此时原始的卷积测出那个被分解为两个连续的子卷积层： $\mathcal{K}_1\in\R^{r\times c\times w\times h}$ （从 $\text{diag}(\sqrt{s})V^\top$ 重构回）与 $\mathcal{K}_2\in\R^{n\times r\times1\times1}$ （从 $U\text{diag}(\sqrt{s}))$ 重构回）；

接下来就是空间级的分解，过程与上面管道级的分解类似：

首先将 $\mathcal{K}$ 重构（reshape）成二维矩阵 $\hat K\in\R^{nw\times ch}$ ；

然后就可以得到分解后的矩阵 $\mathcal{K}_1\in\R^{nw\times r}$ 与 $\mathcal{K_2}\in\R^{n\times r\times w\times1}$ ；

这种满秩分解模型往往可以取得与原模型相似的精确度。

注意上述中的过滤器（filters）的数量，就是out-channels

同样地也可以参考E:\Anaconda3\Lib\site-packages\torch\nn\modules下面的conv.py中的源码，尽可能地继承与模仿地改写，以减少后续模型转换时的时间成本：

class Conv2d(_ConvNd):
    
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: _size_2_t,
        stride: _size_2_t = 1,
        padding: _size_2_t = 0,
        dilation: _size_2_t = 1,
        groups: int = 1,
        bias: bool = True,
        padding_mode: str = 'zeros'  # TODO: refine this type
    ):
        kernel_size = _pair(kernel_size)
        stride = _pair(stride)
        padding = _pair(padding)
        dilation = _pair(dilation)
        super(Conv2d, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            False, _pair(0), groups, bias, padding_mode)

    def _conv_forward(self, input, weight):
        if self.padding_mode != 'zeros':
            return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
                            weight, self.bias, self.stride,
                            _pair(0), self.dilation, self.groups)
        return F.conv2d(input, weight, self.bias, self.stride,
                        self.padding, self.dilation, self.groups)

    def forward(self, input: Tensor) -> Tensor:
        return self._conv_forward(input, self.weight)

相对于全连接层而言稍微复杂一些，不过也区别不大，笔者是这样改写的：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch
from torch.nn import functional as F

class Conv2dSVD(torch.nn.Conv2d):
	"""二维卷积层的SVD形式"""
	def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', decomposition_mode='channel'):		
		super(Conv2dSVD, self).__init__(
			in_channels=in_channels, 
			out_channels=out_channels, 
			kernel_size=kernel_size, 
			stride=stride, 
			padding=padding, 
			dilation=dilation, 
			groups=groups, 
			bias=bias, 
			padding_mode=padding_mode,
		)
		kernel_height, kernel_width = self.kernel_size
		self.decomposition_mode = decomposition_mode
		if self.decomposition_mode == 'channel':						 # 管道级的分解
			rank = min(out_channels, in_channels * kernel_height * kernel_width)														# r = min(n, cwh)
			self.svd_weight_matrix_u = torch.nn.Parameter(torch.Tensor(out_channels, rank))												# 左奇异向量矩阵，形状为n×r
			self.svd_weight_matrix_v = torch.nn.Parameter(torch.Tensor(in_channels * kernel_width * kernel_height, rank))				# 右奇异向量矩阵，形状为cwh×r
			self.svd_weight_vector_s = torch.nn.Parameter(torch.Tensor(rank, ))															# 奇异值向量
		
		elif self.decomposition_mode == 'spatial':						 # 空间级的分解
			rank = min(out_channels * kernel_width, in_channels * kernel_height)														# r = min(nw, ch)
			self.svd_weight_matrix_u = torch.nn.Parameter(torch.Tensor(out_channels * kernel_width, rank))								# 左奇异向量矩阵，形状为nw×r
			self.svd_weight_matrix_v = torch.nn.Parameter(torch.Tensor(in_channels * kernel_height, rank))								# 右奇异向量矩阵，形状为ch×r
			self.svd_weight_vector_s = torch.nn.Parameter(torch.Tensor(rank, ))															# 奇异值向量
		else:
			raise Exception(f'Unknown decomposition mode: {decomposition_mode}')

	def forward(self, input):
		kernel_height, kernel_width = self.kernel_size
		if self.decomposition_mode == 'channel':						 # 管道级的分解
			weight = torch.mm(torch.mm(self.svd_weight_matrix_u, torch.diag(self.svd_weight_vector_s)), self.svd_weight_matrix_v.t())	# (out_channels, in_channels * kernel_width * kernel_height)
			weight = weight.reshape(self.out_channels, self.in_channels, kernel_height, kernel_width)									# (out_channels, in_channels , kernel_height, kernel_width)
		elif self.decomposition_mode == 'spatial':						 # 空间级的分解
			weight = torch.mm(torch.mm(self.svd_weight_matrix_u, torch.diag(self.svd_weight_vector_s)), self.svd_weight_matrix_v.t())	# (out_channels * kernel_width, in_channels * kernel_height)
			weight = weight.reshape(self.out_channels, kernel_width, self.in_channels, kernel_height)									# (out_channels, kernel_width, in_channels , kernel_height)
			weight = weight.permute((0, 2, 3, 1))																						# 这里我觉得可能直接reshape成(out_channels, in_channels , kernel_height, kernel_width)也可以，但是从矩阵的形状上来看可能还是按顺序reshape更合理一些，最后再多做一个维度置换即可
		# 用法: torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)
		if not self.padding_mode == 'zeros':							 # 这部分似乎源码中是未完成状态，内容稍微有点乱，不过一般来说padding都是零填充，所以用不到这边的内容
			from torch._six import container_abcs
			from itertools import repeat
			def _reverse_repeat_tuple(t, n):
				return tuple(x for x in reversed(t) for _ in range(n))
			def _ntuple(n):
				def parse(x):
					if isinstance(x, container_abcs.Iterable):
						return x
					return tuple(repeat(x, n))
				return parse
			_pair = _ntuple(2)
			return F.conv2d(F.pad(input, _reverse_repeat_tuple(self.padding, 2), mode=self.padding_mode), weight, self.bias, self.stride, _pair(0), self.dilation, self.groups)
		return F.conv2d(input, weight, self.bias, self.stride, self.padding, self.dilation, self.groups)

$\text{OK}$ ，网络层改写好了，接下来就是定义损失函数了。

3 如何重写 $\text{SVD}$ 训练形式下的损失函数？

从第一部分的示例中，其实可以发现损失函数并不难写，只需要将model作为参数传入，问题就迎刃而解了。

因为 $\rm CIFAR10$ 是一个多分类问题，所以这里选择对交叉熵损失函数进行重写，方法与上面重写网络层类似，可以参考E:\Anaconda3\Lib\site-packages\torch\nn\modules下的loss.py中的源码，这里不再赘述，继承torch.nn.CrossEntropyLoss类编写即可：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch
from torch.nn import functional as F

class CrossEntropyLossSVD(torch.nn.CrossEntropyLoss):

	def __init__(self, weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean'):
		super(CrossEntropyLossSVD, self).__init__(
			weight=weight,
			size_average=size_average,
			ignore_index=ignore_index,
			reduce=reduce,
			reduction=reduction,
		)

	def forward(self, input, target, model=None, regularizer_weights=[1, 1], orthogonal_suffix='svd_weight_matrix', sparse_suffix='svd_weight_vector', mode='lh') -> torch.FloatTensor:
		cross_entropy_loss = F.cross_entropy(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
		if model is None:
			return cross_entropy_loss
		
		# 正交正则器
		def _orthogonality_regularizer(x):		 		 				 # x应当是一个2D张量（矩阵）且高大于宽
			return torch.norm(torch.mm(x.t(), x) - torch.eye(x.shape[1]).cuda(), p='fro') / x.shape[1] / x.shape[1]
		
		# 稀疏导出正则器
		def _sparsity_inducing_regularizer(x, mode='lh'):				 # x应当是一个1D张量（向量）
			mode = mode.lower()
			if mode == 'lh':
				return torch.norm(x, 1) / torch.norm(x, 2)	
			elif model == 'l1':
				return torch.norm(x, 1)
			raise Exception(f'Unknown mode: {mode}')
	
		regularizer = torch.zeros(1, ).cuda()
		for name, parameter in model.named_parameters():
			lastname = name.split('.')[-1]
			if lastname.startswith(orthogonal_suffix):					 # 奇异向量矩阵参数：添加正交正则项
				regularizer += _orthogonality_regularizer(parameter) * regularizer_weights[0]
			elif lastname.startswith(sparse_suffix):					 # 奇异值向量参数：添加稀疏导出正则项
				regularizer += _sparsity_inducing_regularizer(parameter, mode) * regularizer_weights[1]
		return cross_entropy_loss + regularizer

这里可以看到，笔者在改写全连接层与二维卷积层时，奇异值向量参数统一以svd_weight_vector开头命名，奇异向量矩阵参数统一以svd_weight_matrix开头命名，所以这里也是根据这样的命名前缀特征来捕获需要正则的参数。

值得注意的问题是regularizer = torch.zeros(1, ).cuda()，因为在模型设定在 $\rm GPU$ 的前提下，所有模型参数的默认device是cuda，如果写成regularizer = torch.zeros(1, )，则regularizer是默认device在cpu上的，这样就会发生运算的两个张量不在同一个device上的报错问题。

4 如何重写 $\text{SVD}$ 训练形式下的神经网络模型？（以 $\rm ResNet18$ 为例）

事实上在E:\Anaconda3\Lib\site-packages\torchvision\models下面有不少流行模型的开源代码，随着torchvision的升级会不断更新新的研究结果进来，包括老牌的alexnet，resnet，imagenet，以及现在一些新的googlenet等，而且这些代码可以其实直接复制过来就能用，都不需要做什么修改，然后就可以在巨人肩膀上试着去重构模型了。

关于 $\rm CIFAR10$ 数据集的获取， $\text{PyTorch}$ 给了一个直接下载 $\rm CIFAR10$ 数据集的接口，不过也可以自己到http://www.cs.toronto.edu/~kriz/cifar.html自取：

import torchvision as tv            #里面含有许多数据集
import torch
import torchvision.transforms as transforms    #实现图片变换处理的包
from torchvision.transforms import ToPILImage

#使用torchvision加载并预处理CIFAR10数据集
show = ToPILImage()         #可以把Tensor转成Image,方便进行可视化
transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize(mean = (0.5,0.5,0.5),std = (0.5,0.5,0.5))])#把数据变为tensor并且归一化range [0, 255] -> [0.0,1.0]
trainset = tv.datasets.CIFAR10(root='data1/',train = True,download=True,transform=transform)
trainloader = torch.utils.data.DataLoader(trainset,batch_size=4,shuffle=True,num_workers=0)
testset = tv.datasets.CIFAR10('data1/',train=False,download=True,transform=transform)
testloader = torch.utils.data.DataLoader(testset,batch_size=4,shuffle=True,num_workers=0)
classes = ('plane','car','bird','cat','deer','dog','frog','horse','ship','truck')
(data,label) = trainset[100]
print(classes[label])#输出ship
show((data+1)/2).resize((100,100))
dataiter = iter(trainloader)
images, labels = dataiter.next()
print(' '.join('%11s'%classes[labels[j]] for j in range(4)))
show(tv.utils.make_grid((images+1)/2)).resize((400,100))#make_grid的作用是将若干幅图像拼成一幅图像

注意如果你已经下载好了的话，之后就可以设置成download=False了，没必要每次都去下载。

然后笔者在网上找了一个直接可以跑通的用 $\rm ResNet18$ 建模 $\rm CIFAR10$ 训练的脚本（抱歉忘了原作者是谁，但是好像这种脚本还挺多的…）：

#coding=gbk

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision as tv
import torchvision.transforms as transforms
import argparse
import os

class ResidualBlock(nn.Module):
	def __init__(self, inchannel, outchannel, stride=1):
		super(ResidualBlock, self).__init__()
		self.left = nn.Sequential(
			nn.Conv2d(inchannel, outchannel, kernel_size=3, stride=stride, padding=1, bias=False),
			nn.BatchNorm2d(outchannel),
			nn.ReLU(inplace=True),
			nn.Conv2d(outchannel, outchannel, kernel_size=3, stride=1, padding=1, bias=False),
			nn.BatchNorm2d(outchannel)
		)
		self.shortcut = nn.Sequential()
		if stride != 1 or inchannel != outchannel:
			self.shortcut = nn.Sequential(
				nn.Conv2d(inchannel, outchannel, kernel_size=1, stride=stride, bias=False),
				nn.BatchNorm2d(outchannel)
			)

	def forward(self, x):
		out = self.left(x)
		out += self.shortcut(x)
		out = F.relu(out)
		return out

class ResNet(nn.Module):
	def __init__(self, ResidualBlock, num_classes=10):
		super(ResNet, self).__init__()
		self.inchannel = 64
		self.conv1 = nn.Sequential(
			nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False),
			nn.BatchNorm2d(64),
			nn.ReLU(),
		)
		self.layer1 = self.make_layer(ResidualBlock, 64,  2, stride=1)
		self.layer2 = self.make_layer(ResidualBlock, 128, 2, stride=2)
		self.layer3 = self.make_layer(ResidualBlock, 256, 2, stride=2)
		self.layer4 = self.make_layer(ResidualBlock, 512, 2, stride=2)
		self.fc = nn.Linear(512, num_classes)

	def make_layer(self, block, channels, num_blocks, stride):
		strides = [stride] + [1] * (num_blocks - 1)   #strides=[1,1]
		layers = []
		for stride in strides:
			layers.append(block(self.inchannel, channels, stride))
			self.inchannel = channels
		return nn.Sequential(*layers)

	def forward(self, x):
		out = self.conv1(x)
		out = self.layer1(out)
		out = self.layer2(out)
		out = self.layer3(out)
		out = self.layer4(out)
		out = F.avg_pool2d(out, 4)
		out = out.view(out.size(0), -1)
		out = self.fc(out)
		return out


def ResNet18():

	return ResNet(ResidualBlock)

# 定义是否使用GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 参数设置,使得我们能够手动输入命令行参数，就是让风格变得和Linux命令行差不多
parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
parser.add_argument('--outf', default='./model/', help='folder to output images and model checkpoints') #输出结果保存路径
args = parser.parse_args()

# 超参数设置
EPOCH = 135   #遍历数据集次数
pre_epoch = 0  # 定义已经遍历数据集的次数
BATCH_SIZE = 128      #批处理尺寸(batch_size)
LR = 0.01        #学习率

# 准备数据集并预处理
transform_train = transforms.Compose([
	transforms.RandomCrop(32, padding=4),  #先四周填充0，在吧图像随机裁剪成32*32
	transforms.RandomHorizontalFlip(),  #图像一半的概率翻转，一半的概率不翻转
	transforms.ToTensor(),
	transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)), #R,G,B每层的归一化用到的均值和方差
])

transform_test = transforms.Compose([
	transforms.ToTensor(),
	transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = tv.datasets.CIFAR10(root='data/', train=True, download=True, transform=transform_train)
testset = tv.datasets.CIFAR10('data/', train=False, download=True, transform=transform_test)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=0)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=True, num_workers=0)


# Cifar-10的标签
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# 模型定义-ResNet
net = ResNet18().to(device)

# 定义损失函数和优化方式
criterion = nn.CrossEntropyLoss()  #损失函数为交叉熵，多用于多分类问题
optimizer = optim.SGD(net.parameters(), lr=LR, momentum=0.9, weight_decay=5e-4) #优化方式为mini-batch momentum-SGD，并采用L2正则化（权重衰减）

# 训练
if __name__ == "__main__":
	if not os.path.exists(args.outf):
		os.makedirs(args.outf)
	best_acc = 85  #2 初始化best test accuracy
	print("Start Training, Resnet-18!")  # 定义遍历数据集的次数
	with open("acc.txt", "w") as f:
		with open("log.txt", "w")as f2:
			for epoch in range(pre_epoch, EPOCH):
				print('\nEpoch: %d' % (epoch + 1))
				net.train()
				sum_loss = 0.0
				correct = 0.0
				total = 0.0
				for i, data in enumerate(trainloader, 0):
					# 准备数据
					length = len(trainloader)
					inputs, labels = data
					inputs, labels = inputs.to(device), labels.to(device)
					optimizer.zero_grad()

					# forward + backward
					outputs = net(inputs)
					loss = criterion(outputs, labels)
					loss.backward()
					optimizer.step()

					# 每训练1个batch打印一次loss和准确率
					sum_loss += loss.item()
					_, predicted = torch.max(outputs.data, 1)
					total += labels.size(0)
					correct += predicted.eq(labels.data).cpu().sum()
					print('[epoch:%d, iter:%d] Loss: %.03f | Acc: %.3f%% '
						  % (epoch + 1, (i + 1 + epoch * length), sum_loss / (i + 1), 100. * correct / total))
					f2.write('%03d  %05d |Loss: %.03f | Acc: %.3f%% '
						  % (epoch + 1, (i + 1 + epoch * length), sum_loss / (i + 1), 100. * correct / total))
					f2.write('\n')
					f2.flush()

				# 每训练完一个epoch测试一下准确率
				print("Waiting Test!")
				with torch.no_grad():
					correct = 0
					total = 0
					for data in testloader:
						net.eval()
						images, labels = data
						images, labels = images.to(device), labels.to(device)
						outputs = net(images)
						# 取得分最高的那个类 (outputs.data的索引号)
						_, predicted = torch.max(outputs.data, 1)
						total += labels.size(0)
						correct += (predicted == labels).sum()
					print('测试分类准确率为：%.3f%%' % (100 * correct / total))
					acc = 100. * correct / total
					# 将每次测试结果实时写入acc.txt文件中
					print('Saving model......')
					torch.save(net.state_dict(), '%s/net_%03d.pth' % (args.outf, epoch + 1))
					f.write("EPOCH=%03d,Accuracy= %.3f%%" % (epoch + 1, acc))
					f.write('\n')
					f.flush()
					# 记录最佳测试分类准确率并写入best_acc.txt文件中
					if acc > best_acc:
						f3 = open("best_acc.txt", "w")
						f3.write("EPOCH=%d,best_acc= %.3f%%" % (epoch + 1, acc))
						f3.close()
						best_acc = acc
			print("Training Finished, TotalEPOCH=%d" % EPOCH)

上面这个训练结果可以达到非常高的精确度，笔者将它直接改写成 $\rm SVD$ 的形式（借助上面改写好的几个类，Conv2dSVD, LinearSVD放在cifar_layers.py中，CrossEntropyLossSVD放在cifar_loss.py中），其实就直接把所有的nn.Conv2d替换成Conv2dSVD，nn.Linear替换成LinearSVD，nn.CrossEntropyLoss替换成CrossEntropyLossSVD即可：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision as tv
import torchvision.transforms as transforms
import argparse
import os

from cifar_layers import Conv2dSVD, LinearSVD
from cifar_loss import CrossEntropyLossSVD

class ResidualBlock(nn.Module):
	def __init__(self, inchannel, outchannel, stride=1):
		super(ResidualBlock, self).__init__()
		self.left = nn.Sequential(
			Conv2dSVD(inchannel, outchannel, kernel_size=3, stride=stride, padding=1, bias=False),
			nn.BatchNorm2d(outchannel),
			nn.ReLU(inplace=True),
			Conv2dSVD(outchannel, outchannel, kernel_size=3, stride=1, padding=1, bias=False),
			nn.BatchNorm2d(outchannel)
		)
		self.shortcut = nn.Sequential()
		if stride != 1 or inchannel != outchannel:
			self.shortcut = nn.Sequential(
				Conv2dSVD(inchannel, outchannel, kernel_size=1, stride=stride, bias=False),
				nn.BatchNorm2d(outchannel)
			)

	def forward(self, x):
		out = self.left(x)
		out += self.shortcut(x)
		out = F.relu(out)
		return out

class ResNet(nn.Module):
	def __init__(self, ResidualBlock, num_classes=10):
		super(ResNet, self).__init__()
		self.inchannel = 64
		self.conv1 = nn.Sequential(
			Conv2dSVD(3, 64, kernel_size=3, stride=1, padding=1, bias=False),
			nn.BatchNorm2d(64),
			nn.ReLU(),
		)
		self.layer1 = self.make_layer(ResidualBlock, 64,  2, stride=1)
		self.layer2 = self.make_layer(ResidualBlock, 128, 2, stride=2)
		self.layer3 = self.make_layer(ResidualBlock, 256, 2, stride=2)
		self.layer4 = self.make_layer(ResidualBlock, 512, 2, stride=2)
		self.fc = LinearSVD(512, num_classes)

	def make_layer(self, block, channels, num_blocks, stride):
		strides = [stride] + [1] * (num_blocks - 1)
		layers = []
		for stride in strides:
			layers.append(block(self.inchannel, channels, stride))
			self.inchannel = channels
		return nn.Sequential(*layers)

	def forward(self, x):
		out = self.conv1(x)
		out = self.layer1(out)
		out = self.layer2(out)
		out = self.layer3(out)
		out = self.layer4(out)
		out = F.avg_pool2d(out, 4)
		out = out.view(out.size(0), -1)
		out = self.fc(out)
		return out


def ResNet18():

	return ResNet(ResidualBlock)



# 定义是否使用GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 参数设置,使得我们能够手动输入命令行参数，就是让风格变得和Linux命令行差不多
parser = argparse.ArgumentParser(description='PyTorch CIFAR10 SVD Training')
parser.add_argument('--outf', default='./svdmodel/', help='folder to output images and model checkpoints') #输出结果保存路径
args = parser.parse_args()

# 超参数设置
EPOCH = 135   #遍历数据集次数
pre_epoch = 0  # 定义已经遍历数据集的次数
BATCH_SIZE = 128      #批处理尺寸(batch_size)
LR = 0.01        #学习率

# 模型定义-ResNet
net = ResNet18().to(device)

# Cifar-10的标签
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


# 定义损失函数和优化方式
criterion = CrossEntropyLossSVD()  #损失函数为交叉熵，多用于多分类问题
optimizer = optim.SGD(net.parameters(), lr=LR, momentum=0.9, weight_decay=5e-4) #优化方式为mini-batch momentum-SGD，并采用L2正则化（权重衰减）




# 准备数据集并预处理
transform_train = transforms.Compose([
	transforms.RandomCrop(32, padding=4),  #先四周填充0，在吧图像随机裁剪成32*32
	transforms.RandomHorizontalFlip(),  #图像一半的概率翻转，一半的概率不翻转
	transforms.ToTensor(),
	transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)), #R,G,B每层的归一化用到的均值和方差
])

transform_test = transforms.Compose([
	transforms.ToTensor(),
	transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = tv.datasets.CIFAR10(root='data/', train=True, download=False, transform=transform_train)
testset = tv.datasets.CIFAR10('data/', train=False, download=False, transform=transform_test)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=0)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=True, num_workers=0)


# 训练
if __name__ == "__main__":
	if not os.path.exists(args.outf):
		os.makedirs(args.outf)
	best_acc = 85  #2 初始化best test accuracy
	print("Start SVD Training, Resnet-18!")  # 定义遍历数据集的次数
	with open("svdacc.txt", "w") as f:
		with open("svdlog.txt", "w")as f2:
			for epoch in range(pre_epoch, EPOCH):
				print('\nEpoch: %d' % (epoch + 1))
				net.train()
				sum_loss = 0.0
				correct = 0.0
				total = 0.0
				for i, data in enumerate(trainloader, 0):
					# 准备数据
					length = len(trainloader)
					inputs, labels = data
					inputs, labels = inputs.to(device), labels.to(device)
					optimizer.zero_grad()
					
					# forward + backward
					outputs = net(inputs)

					loss = criterion(outputs, labels, net)
					loss.backward()
					optimizer.step()

					# 每训练1个batch打印一次loss和准确率
					sum_loss += loss.item()
					_, predicted = torch.max(outputs.data, 1)
					total += labels.size(0)
					correct += predicted.eq(labels.data).cpu().sum()
					print('[epoch:%d, iter:%d] Loss: %.03f | Acc: %.3f%% '
						  % (epoch + 1, (i + 1 + epoch * length), sum_loss / (i + 1), 100. * correct / total))
					f2.write('%03d  %05d |Loss: %.03f | Acc: %.3f%% '
						  % (epoch + 1, (i + 1 + epoch * length), sum_loss / (i + 1), 100. * correct / total))
					f2.write('\n')
					f2.flush()

				# 每训练完一个epoch测试一下准确率
				print("Waiting Test!")
				with torch.no_grad():
					correct = 0
					total = 0
					for data in testloader:
						net.eval()
						images, labels = data
						images, labels = images.to(device), labels.to(device)
						outputs = net(images)
						# 取得分最高的那个类 (outputs.data的索引号)
						_, predicted = torch.max(outputs.data, 1)
						total += labels.size(0)
						correct += (predicted == labels).sum()
					print('测试分类准确率为：%.3f%%' % (100 * correct / total))
					acc = 100. * correct / total
					# 将每次测试结果实时写入acc.txt文件中
					print('Saving model......')
					torch.save(net.state_dict(), '%s/net_%03d.pth' % (args.outf, epoch + 1))
					f.write("EPOCH=%03d,Accuracy= %.3f%%" % (epoch + 1, acc))
					f.write('\n')
					f.flush()
					# 记录最佳测试分类准确率并写入best_acc.txt文件中
					if acc > best_acc:
						f3 = open("svd_best_acc.txt", "w")
						f3.write("EPOCH=%d,best_acc= %.3f%%" % (epoch + 1, acc))
						f3.close()
						best_acc = acc
			print("Training Finished, TotalEPOCH=%d" % EPOCH)

还不错，笔者试了能跑通，但是应该还有一些问题需要解决，今日暂且搁笔。

2021年6月22日更新

关于【论文实现】以SVD的分解形式进行深度神经网络的训练（PyTorch）的代码有一点小问题，忘了对重写的卷积层和线性层的权重矩阵参数初始化，导致网络输出中有大量的 $\rm NaN$ ，损失函数计算出问题。

其实源码看得还是不仔细，torch里对所有权重矩阵参数都做了kaiming_uniform_的初始化，不做初始化权重就都是零，所以会出一点问题。不过我也懒得改原博客了，反正也不会有人真的会去细究，修正后的cifar_layers.py如下（构造函数中添加初始化，cifar_init模块即E:\Anaconda3\Lib\site-packages\torch\nn\init.py中的源码，直接复制过来用即可）：

现在有新的问题，就是似乎跑得很慢，按道理SVDTraining应该很快才对，不知道是哪里理解的不对。

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu.cn

import math
import torch
import cifar_init as init
from torch.nn import functional as F

class Conv2dSVD(torch.nn.Conv2d):
	"""二维卷积层的SVD形式"""
	def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', decomposition_mode='channel'):		
		super(Conv2dSVD, self).__init__(
			in_channels=in_channels, 
			out_channels=out_channels, 
			kernel_size=kernel_size, 
			stride=stride, 
			padding=padding, 
			dilation=dilation, 
			groups=groups, 
			bias=bias, 
			padding_mode=padding_mode,
		)
		kernel_height, kernel_width = self.kernel_size
		self.decomposition_mode = decomposition_mode
		if self.decomposition_mode == 'channel':						 # 管道级的分解
			rank = min(out_channels, in_channels * kernel_height * kernel_width)														# r = min(n, cwh)
			self.svd_weight_matrix_u = torch.nn.Parameter(torch.Tensor(out_channels, rank))												# 左奇异向量矩阵，形状为n×r
			self.svd_weight_matrix_v = torch.nn.Parameter(torch.Tensor(in_channels * kernel_width * kernel_height, rank))				# 右奇异向量矩阵，形状为cwh×r
			self.svd_weight_vector_s = torch.nn.Parameter(torch.Tensor(rank, ))															# 奇异值向量
		
		elif self.decomposition_mode == 'spatial':						 # 空间级的分解
			rank = min(out_channels * kernel_width, in_channels * kernel_height)														# r = min(nw, ch)
			self.svd_weight_matrix_u = torch.nn.Parameter(torch.Tensor(out_channels * kernel_width, rank))								# 左奇异向量矩阵，形状为nw×r
			self.svd_weight_matrix_v = torch.nn.Parameter(torch.Tensor(in_channels * kernel_height, rank))								# 右奇异向量矩阵，形状为ch×r
			self.svd_weight_vector_s = torch.nn.Parameter(torch.Tensor(rank, ))															# 奇异值向量
		else:
			raise Exception(f'Unknown decomposition mode: {decomposition_mode}')
			
		# 参数初始化：注意源码中构造函数中有self.reset_parameters
		init.kaiming_uniform_(self.svd_weight_matrix_u, a=math.sqrt(5))
		init.kaiming_uniform_(self.svd_weight_matrix_v, a=math.sqrt(5))
		init.uniform_(self.svd_weight_vector_s, -1, 1)

	def forward(self, input):
		kernel_height, kernel_width = self.kernel_size
		if self.decomposition_mode == 'channel':						 # 管道级的分解
			weight = torch.mm(torch.mm(self.svd_weight_matrix_u, torch.diag(self.svd_weight_vector_s)), self.svd_weight_matrix_v.t())	# (out_channels, in_channels * kernel_width * kernel_height)
			weight = weight.reshape(self.out_channels, self.in_channels, kernel_height, kernel_width)									# (out_channels, in_channels , kernel_height, kernel_width)
		elif self.decomposition_mode == 'spatial':						 # 空间级的分解
			weight = torch.mm(torch.mm(self.svd_weight_matrix_u, torch.diag(self.svd_weight_vector_s)), self.svd_weight_matrix_v.t())	# (out_channels * kernel_width, in_channels * kernel_height)
			weight = weight.reshape(self.out_channels, kernel_width, self.in_channels, kernel_height)									# (out_channels, kernel_width, in_channels , kernel_height)
			weight = weight.permute((0, 2, 3, 1))																						# 这里我觉得可能直接reshape成(out_channels, in_channels , kernel_height, kernel_width)也可以，但是从矩阵的形状上来看可能还是按顺序reshape更合理一些，最后再多做一个维度置换即可
		# 用法: torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)
		if not self.padding_mode == 'zeros':							 # 这部分似乎源码中是未完成状态，内容稍微有点乱，不过一般来说padding都是零填充，所以用不到这边的内容
			from torch._six import container_abcs
			from itertools import repeat
			def _reverse_repeat_tuple(t, n):
				return tuple(x for x in reversed(t) for _ in range(n))
			def _ntuple(n):
				def parse(x):
					if isinstance(x, container_abcs.Iterable):
						return x
					return tuple(repeat(x, n))
				return parse
			_pair = _ntuple(2)
			return F.conv2d(F.pad(input, _reverse_repeat_tuple(self.padding, 2), mode=self.padding_mode), weight, self.bias, self.stride, _pair(0), self.dilation, self.groups)
		return F.conv2d(input, weight, self.bias, self.stride, self.padding, self.dilation, self.groups)

class LinearSVD(torch.nn.Linear):
	"""线性层的SVD形式"""
	def __init__(self, in_features, out_features, bias=True):
		super(LinearSVD, self).__init__(in_features, out_features, bias)
		rank = min(in_features, out_features)
		self.svd_weight_matrix_u = torch.nn.Parameter(torch.Tensor(out_features, rank))
		self.svd_weight_matrix_v = torch.nn.Parameter(torch.Tensor(in_features, rank))
		self.svd_weight_vector_s = torch.nn.Parameter(torch.Tensor(rank, ))
		
		# 参数初始化：注意源码中构造函数中有self.reset_parameters
		init.kaiming_uniform_(self.svd_weight_matrix_u, a=math.sqrt(5))
		init.kaiming_uniform_(self.svd_weight_matrix_v, a=math.sqrt(5))
		init.uniform_(self.svd_weight_vector_s, -1, 1)
		
	def forward(self, input):
		weight = torch.mm(torch.mm(self.svd_weight_matrix_u, torch.diag(self.svd_weight_vector_s)), self.svd_weight_matrix_v.t())
		return F.linear(input, weight, self.bias)

然后最近我做了一些测试，发现如果不带正则项的话基本上可以取得和原模型相近的水平（训练集精确度接近 $80\%$ ，测试集精确度 $80\%$ 出头），但是我按照原文给正则项的系数取 $1$ ，反正就很差，精确度只能到 $20\%$ ，具体训练速度我倒没太注意，主要前一阵都在考试，偷闲做了一些测试，没有细致的分析，今天全部考完，来图书馆开始洗心革面，重新做人，结果还能碰到王洋洋，既来之则安之罢，自己真是够犯贱的了。

$\rm PostScript$

现在是 $2021$ 年 $5$ 月 $30$ 日凌晨 $1 : 48$ ，有一阵子没有熬到这么晚了。最近写了脚本检测日志博客caoyang.log的被访问的时间，发现连续几天有人凌晨一两点来这里看，其实我也不知道会是谁，也不敢乱猜。

我最近心理状态不是特别好，但也没那么坏，至少跑步的状态还不错，相继破了 $5\rm km$ 和 $\rm 6km$ 的个人最好记录，到底身体素质还 $\rm OK$ ，所以还能顶着导师任务的压力熬夜写了这篇博客，想要重振一下自己原本的生活节奏。

确实如 $\rm S$ 所说，走过的路都不是白白走过去，过去的经历会使人更从容，更看开，也更珍惜当下。我想这并不是鸡汤，因为经历到了，对事物的认知确实也完全不同了。

五月渐底，我承认这个月的效率很差，上旬和中旬我并没有完全走得出来，但是我还是尽力在恢复自己的做事效率，希望能够回到以前的状态。但是我回头来看 $\rm S$ 你还是多事了，就不该来管我，就让我这样一个人走下去，等待一个可能永远也不会出现的她。这样搞得我又莫名多了一个徒劳的关注点，偶尔心还是会很乱，约莫是如此罢。

技术共进，成长同行——讯飞AI开发者社区

更多推荐

大数据架构中的AI赋能：从数据平台到智能决策的演进

本文旨在深入探讨大数据架构与人工智能技术的融合演进过程，分析AI如何赋能大数据平台实现从数据存储到智能决策的跨越。我们将覆盖从基础数据平台建设到高级智能应用的全生命周期，重点关注技术架构演进和关键实现技术。文章将从大数据架构的基础概念开始，逐步引入AI赋能的技术路径，分析核心算法原理，展示实际应用案例，并探讨未来发展趋势。最后提供总结和思考题，帮助读者巩固所学知识。大数据架构：处理海量数据的系统设

讯飞AI开发者社区

2025年数字化转型时代必备证！

讯飞AI开发者社区

Python先进技术全面发展无人有人监听探索器空间站研发开发重要性智能化系统化武器多样化太阳能利用回收利用可再生能源变形金刚机器人

概念解析：一个融合性未来科技平台您提出的这个概念，本质上是一个以人工智能为核心、以空间站为基地、以可再生能源为动力、具备多功能（科研、监听、防御/攻击、机器人技术）的智能化、系统化平台。· 系统集成：作为“胶水语言”，将平台上各种异构的、多样化的武器系统、传感器、能源模块、机器人控制系统无缝连接成一个系统化整体。—总结：整体研发的重要性研发这样一个高度融合的平台，其战略重要性是前所未有的：1. 技