1 逻辑回归为什么可以初始化为0 ?

1.1 参数说明

输入: x1,x2x_1, x_2x1,x2
输出: aaa
权重: w1,w2w_1, w_2w1,w2
偏置: bbb
激活函数: sigmoidsigmoidsigmoid
损失函数: crossentropycross entropycrossentropy

逻辑回归用公式表达为: a=sigmoid(w1x1+w2x2+b)a = sigmoid(w_1x_1 + w_2x_2 + b)a=sigmoid(w1x1+w2x2+b)
损失函数: L=−ylog(a)−(1−y)log(1−a)L = -y log(a) - (1-y) log(1-a)L=ylog(a)(1y)log(1a)

1.2 反向传播

sigmoid函数的导数: s′=s(1−s)s^{'} = s (1 - s)s=s(1s)

∂L∂a=−ya+1−y1−a\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}aL=ay+1a1y

∂L∂w1=∂L∂a⋅∂a∂x=(a−y)x1\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a} · \frac{\partial a}{\partial x} = (a-y)x_1w1L=aLxa=(ay)x1

∂L∂w2=(a−y)x2\frac{\partial L}{\partial w_2} = (a-y)x_2w2L=(ay)x2

∂L∂b=(a−y)\frac{\partial L}{\partial b} = (a-y)bL=(ay)

1.3 参数更新

w1:=w1−α∂L∂w1w_1 := w_1 - \alpha \frac{\partial L}{\partial w_1}w1:=w1αw1L

w2:=w2−α∂L∂w2w_2 := w_2 - \alpha \frac{\partial L}{\partial w_2}w2:=w2αw2L

b:=b−α∂L∂bb := b - \alpha \frac{\partial L}{\partial b}b:=bαbL

1.4 分析

当把w1,w2w_1, w_2w1,w2初始化为0,∂L∂w1,∂L∂w2\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}w1L,w2L与无关,梯度不为0,参数可以正常更新。当bbb初始化为0,梯度为0.5或-0.5,也不影响参数更新。

2 为什么神经网络的权重不可以初始化为0 ?

2.1 参数说明

输入: x1,x2x_1, x_2x1,x2
输出: a1,a2,a3a_1, a_2, a_3a1,a2,a3
权重: w11,w12,w21,w22,w13,w23w_{11}, w_{12}, w_{21}, w_{22}, w_{13}, w_{23}w11,w12,w21,w22,w13,w23
偏置: b1,b2,b3b_1, b_2, b_3b1,b2,b3
激活函数: sigmoidsigmoidsigmoid
损失函数: crossentropycross entropycrossentropy

损失函数: L=−ylog(a3)−(1−y)log(1−a3)L = -y log(a_3) - (1-y) log(1-a_3)L=ylog(a3)(1y)log(1a3)

2.2 前向传播

a1=sigmoid(w11x1+w21x2+b1)a_1 = sigmoid(w_{11}x_1 + w_{21}x_2 + b_1)a1=sigmoid(w11x1+w21x2+b1)

a2=sigmoid(w12x1+w22x2+b2)a_2 = sigmoid(w_{12}x_1 + w_{22}x_2 + b_2)a2=sigmoid(w12x1+w22x2+b2)

a3=sigmoid(w13a1+w23a2+b3)a_3 = sigmoid(w_{13}a_1 + w_{23}a_2 + b_3)a3=sigmoid(w13a1+w23a2+b3)

2.3 反向传播

∂L∂a3=−ya3+1−y1−a3\frac{\partial L}{\partial a_3} = -\frac{y}{a_3} + \frac{1-y}{1-a_3}a3L=a3y+1a31y

∂L∂w13=(a3−y)a1\frac{\partial L}{\partial w_{13}} = (a_3 - y)a_1w13L=(a3y)a1

∂L∂w23=(a3−y)a2\frac{\partial L}{\partial w_{23}} = (a_3 - y)a_2w23L=(a3y)a2

∂L∂b3=(a3−y)\frac{\partial L}{\partial b_3} = (a_3 - y)b3L=(a3y)

∂L∂a1=(a3−y)w13\frac{\partial L}{\partial a_1} = (a_3 - y)w_{13}a1L=(a3y)w13

∂L∂a2=(a3−y)w23\frac{\partial L}{\partial a_2} = (a_3 - y)w_{23}a2L=(a3y)w23

∂L∂w11=(a3−y)w13a1(1−a1)x1\frac{\partial L}{\partial w_{11}} = (a_3 - y)w_{13} a_1(1-a_1) x_1w11L=(a3y)w13a1(1a1)x1

∂L∂w21=(a3−y)w13a1(1−a1)x2\frac{\partial L}{\partial w_{21}} = (a_3 - y)w_{13} a_1(1-a_1) x_2w21L=(a3y)w13a1(1a1)x2

∂L∂b1=(a3−y)w13a1(1−a1)\frac{\partial L}{\partial b_{1}} = (a_3 - y)w_{13} a_1(1-a_1)b1L=(a3y)w13a1(1a1)

∂L∂w12=(a3−y)w23a2(1−a2)x1\frac{\partial L}{\partial w_{12}} = (a_3 - y)w_{23} a_2(1-a_2) x_1w12L=(a3y)w23a2(1a2)x1

∂L∂w22=(a3−y)w23a2(1−a2)x2\frac{\partial L}{\partial w_{22}} = (a_3 - y)w_{23} a_2(1-a_2) x_2w22L=(a3y)w23a2(1a2)x2

∂L∂b2=(a3−y)w23a2(1−a2)\frac{\partial L}{\partial b_{2}} = (a_3 - y)w_{23} a_2(1-a_2)b2L=(a3y)w23a2(1a2)

2.4 讨论

可以分为下面3个情况

  • w初始化为0,b初始化为0
  • w初始化为0,b随机初始化
  • w随机初始化,b初始化为0

2.4.1 w初始化为0,b初始化为0

第一个batch,前向传播 a1=a2=a3=0.5a_1 = a_2 = a_3 = 0.5a1=a2=a3=0.5,由于a1=a2a_1=a_2a1=a2, 反向传播会导致w13,w23w_{13}, w_{23}w13,w23可以得到更新但是两个权重均相同,同时b3b_3b3也得到了更新,w11,w21,w12,w22,b1,b2w_{11}, w_{21}, w_{12}, w_{22}, b_1, b_2w11,w21,w12,w22,b1,b2更新时用到了w13,w23w_{13}, w_{23}w13,w23,开始这两个初始化为0了,所以这几个参数未能更新,依然是0.

第二个batch,w13,w23w_{13}, w_{23}w13,w23两个权重相同,反向传播的时候w11w_{11}w11w12w_{12}w12相同,w21w_{21}w21w22w_{22}w22相同,同样的,a1=a2a_1=a_2a1=a2,由于a1=a2a_1=a_2a1=a2,反向传播会导致w13,w23w_{13}, w_{23}w13,w23可以得到更新但是两个权重均相同。

第n个batch,每一隐藏层的权重都能得到更新,但是存在每一隐藏层的隐藏神经元权重都是一致的,也就是说,同一隐藏层所有神经元的输出都一致。

2.4.2 w初始化为0,b随机初始化

第一个batch,前向传播 a1=sigmoid(b1),a2=sigmoid(b2),a3=sigmoid(b3)a_1 = sigmoid(b_1), a_2 = sigmoid(b_2), a_3 = sigmoid(b_3)a1=sigmoid(b1),a2=sigmoid(b2),a3=sigmoid(b3), 反向传播会w13,w23,b3w_{13}, w_{23}, b_3w13,w23,b3可以得到更新,w11,w21,w12,w22,b1,b2w_{11}, w_{21}, w_{12}, w_{22}, b_1, b_2w11,w21,w12,w22,b1,b2更新时用到了w13,w23w_{13}, w_{23}w13,w23,开始这两个初始化为0了,所以这几个参数未能更新,依然是0.

第二个batch,反向传播的过程中,由于w13,w23w_{13}, w_{23}w13,w23不为0,导致所有的参数都能够得到更新。

2.4.3 w随机初始化,b初始化为0

前向传播过程中,a1,a2a_1, a_2a1,a2均不为0,反向传播的过程中各参数均可以更新

欢迎关注微信公众号(算法工程师面试那些事儿),本公众号聚焦于算法工程师面试,期待和大家一起刷leecode,刷机器学习、深度学习面试题等,共勉~

算法工程师面试那些事儿

Logo

技术共进,成长同行——讯飞AI开发者社区

更多推荐