tensorflow使用L2 regularization正则化修正overfitting过拟合方式

2025-02-11 15:25:23

L2正则化原理：

过拟合的原理：在loss下降，进行拟合的过程中（斜线），不同的batch数据样本造成红色曲线的波动大，图中低点也就是过拟合，得到的红线点低于真实的黑线，也就是泛化更差。

可见，要想减小过拟合，减小这个波动，减少w的数值就能办到。

L2正则化训练的原理：在Loss中加入（乘以系数λ的）参数w的平方和，这样训练过程中就会抑制w的值，w的（绝对）值小，模型复杂度低，曲线平滑，过拟合程度低（奥卡姆剃刀），参考公式如下图：

（正则化是不阻碍你去拟合曲线的，并不是所有参数都会被无脑抑制，实际上这是一个动态过程，是loss（cross_entropy）和L2 loss博弈的一个过程。训练过程会去拟合一个合理的w，正则化又会去抑制w的变化，两项相抵消，无关的wi越变越小，但是比零强一点（就是这一点，比没有要强，这也是L2的trade-off），有用的wi会被保留，处于一个“中庸”的范围，在拟合的基础上更好的泛化。过多的道理和演算就不再赘述。）

那为什么L1不能办到呢？主要是L1有副作用，不太适合这个场景。

L1把L2公式中wi的平方换成wi的绝对值，根据数学特性，这种方式会导致wi不均衡的被减小，有些wi很大，有些wi很小，得到稀疏解，属于特征提取。为什么L1的w衰减比L2的不均衡，这个很直觉的，同样都是让loss低，让w1从0.1降为0，和w2从1.0降为0.9，对优化器和loss来说，是一样的。但是带上平方以后，前者是0.01-0=0.01，后者是1-0.81=0.19，这时候明显是减少w2更划算。下图最能说明问题，横纵轴是w1、w2等高线是loss的值，左图的交点w1=0，w2=max（w2），典型的稀疏解，丢弃了w1，而右图则是在w1和w2之间取得平衡。这就意味着，本来能得到一条曲线，现在w1丢了，得到一条直线，降低过拟合的同时，拟合能力（表达能力）也下降了。

L1和L2有个别名：Lasso和ridge，经常记错，认为ridge岭回归因为比较“尖”，所以是L1，其实ridge对应的图片是这种，或者翻译成“山脊”更合适一些，山脊的特点是一条曲线缓慢绵延下来的。

训练

进行MNIST分类训练，对比cross_entropy和加了l2正则的total_loss。

因为MNIST本来就不复杂，所以FC之前不能做太多CONV，会导致效果太好，不容易分出差距。为展示l2 regularization的效果，我只留一层CONV（注意看FC1的输入是h_pool1，短路了conv2），两层conv可以作为对照组。

直接取train的前1000作为validation，test的前1000作为test。

代码说明，一个基础的CONV+FC结构，对图像进行label预测，通过cross_entropy衡量性能，进行训练。

对需要正则化的weight直接使用l2_loss处理，

把cross_entropy和L2 loss都扔进collection 'losses'中。

wd其实就是公式中的λ，wd越大，惩罚越大，过拟合越小，拟合能力也会变差，所以不能太大不能太小，很多人默认设置成了0.004，一般情况下这样做无所谓，毕竟是前人的经验。但是根据我的实际经验，这个值不是死的，尤其是你自己定制loss函数的时候，假如你的权重交叉熵的数值变成了之前的十倍，如果wd保持不变，那wd就相当于之前的0.0004！就像loss如果用reduce_sum，grad也用reduce_sum一样，很多东西要同步做出改变！

weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
tf.add_to_collection('losses', weight_decay)
tf.add_to_collection('losses', cross_entropy)

total_loss = tf.add_n(tf.get_collection('losses'))提取所有loss，拿total_loss去训练，也就实现了图一中公式的效果。

完整代码如下：


from __future__ import print_function
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# number 1 to 10 data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

def compute_accuracy(v_xs, v_ys):
  global prediction
  y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})
  correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))
  accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
  #result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})
  result = sess.run(accuracy, feed_dict={})
  return result

def weight_variable(shape, wd):
  initial = tf.truncated_normal(shape, stddev=0.1)

  if wd is not None:
    print('wd is not none!!!!!!!')
    weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
    tf.add_to_collection('losses', weight_decay)

  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

def conv2d(x, W):
  # stride [1, x_movement, y_movement, 1]
  # Must have strides[0] = strides[3] = 1
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  # stride [1, x_movement, y_movement, 1]
  return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

# define placeholder for inputs to network
xs = tf.placeholder(tf.float32, [None, 784])/255.  # 28x28
ys = tf.placeholder(tf.float32, [None, 10])
keep_prob = tf.placeholder(tf.float32)
x_image = tf.reshape(xs, [-1, 28, 28, 1])
# print(x_image.shape) # [n_samples, 28,28,1]

## conv1 layer ##
W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32
h_pool1 = max_pool_2x2(h_conv1)                     # output size 14x14x32

## conv2 layer ##
W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64
h_pool2 = max_pool_2x2(h_conv2)                     # output size 7x7x64

#############################################################################
## fc1 layer ##
W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2
#W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2
b_fc1 = bias_variable([1024])
# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]
h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2
#h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2
###############################################################################

h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

## fc2 layer ##
W_fc2 = weight_variable([1024, 10], wd = 0.)
b_fc2 = bias_variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

# the error between prediction and real data
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                       reduction_indices=[1]))    # loss

tf.add_to_collection('losses', cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses'))
print(total_loss)

train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss)

sess = tf.Session()
# important step
# tf.initialize_all_variables() no long valid from
# 2017-03-02 if using tensorflow >= 0.12
if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1:
  init = tf.initialize_all_variables()
else:
  init = tf.global_variables_initializer()
sess.run(init)

for i in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
  # sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
  # sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
  if i % 100 == 0:
    print('train accuracy',compute_accuracy(
      mnist.train.images[:1000], mnist.train.labels[:1000]))
    print('test accuracy',compute_accuracy(
      mnist.test.images[:1000], mnist.test.labels[:1000]))

下边是训练过程

不加dropout，不加l2 regularization，训练1000步：

weight_variable([1024, 10], wd = 0.)

明显每一步train中都好于test（很多有0.01的差距），出现过拟合！

train accuracy 0.094
test accuracy 0.089
train accuracy 0.892
test accuracy 0.874
train accuracy 0.91
test accuracy 0.893
train accuracy 0.925
test accuracy 0.925
train accuracy 0.945
test accuracy 0.935
train accuracy 0.954
test accuracy 0.944
train accuracy 0.961
test accuracy 0.951
train accuracy 0.965
test accuracy 0.955
train accuracy 0.964
test accuracy 0.959
train accuracy 0.962
test accuracy 0.956

不加dropout，FC层加l2 regularization，weight decay因子设置0.004，训练1000步：

weight_variable([1024, 10], wd = 0.004)

过拟合现象明显减轻了不少，甚至有时测试集还好于训练集（因为验证集大小的关系，只展示大概效果。）

train accuracy 0.107
test accuracy 0.145
train accuracy 0.876
test accuracy 0.861
train accuracy 0.91
test accuracy 0.909
train accuracy 0.923
test accuracy 0.919
train accuracy 0.931
test accuracy 0.927
train accuracy 0.936
test accuracy 0.939
train accuracy 0.956
test accuracy 0.949
train accuracy 0.958
test accuracy 0.954
train accuracy 0.947
test accuracy 0.95
train accuracy 0.947
test accuracy 0.953

对照组：不使用l2正则，只用dropout：过拟合现象减轻。

W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)
W_fc2 = weight_variable([1024, 10], wd = 0.)
  sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
train accuracy 0.132
test accuracy 0.104
train accuracy 0.869
test accuracy 0.859
train accuracy 0.898
test accuracy 0.889
train accuracy 0.917
test accuracy 0.906
train accuracy 0.923
test accuracy 0.917
train accuracy 0.928
test accuracy 0.925
train accuracy 0.938
test accuracy 0.94
train accuracy 0.94
test accuracy 0.942
train accuracy 0.947
test accuracy 0.941
train accuracy 0.944
test accuracy 0.947

对照组：双层conv，本身过拟合不明显，结果略

第二种写法：一个公式写完

其实没有本质区别，只是少了一步提取，增加了繁琐代码可读性的区别。

loss =tf.reduce_mean(tf.square(y_ - y) + tf.contrib.layers.l2_regularizer(lambda)(w1)+tf.contrib.layers.l2_regularizer(lambda)(w2)+..........

测一下单独运行正则化操作的效果（加到loss的代码懒得罗列了，太长，就替换前边的代码就可以）：

import tensorflow as tf
CONST_SCALE = 0.5
w = tf.constant([[5.0, -2.0], [-3.0, 1.0]])
with tf.Session() as sess:
  print(sess.run(tf.abs(w)))
  print('preprocessing:', sess.run(tf.reduce_sum(tf.abs(w))))
  print('manual computation:', sess.run(tf.reduce_sum(tf.abs(w)) * CONST_SCALE))
  print('l1_regularizer:', sess.run(tf.contrib.layers.l1_regularizer(CONST_SCALE)(w))) #11 * CONST_SCALE

  print(sess.run(w**2))
  print(sess.run(tf.reduce_sum(w**2)))
  print('preprocessing:', sess.run(tf.reduce_sum(w**2) / 2))#default
  print('manual computation:', sess.run(tf.reduce_sum(w**2) / 2 * CONST_SCALE))
  print('l2_regularizer:', sess.run(tf.contrib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5 * CONST_SCALE
-------------------------------
[[5. 2.]
 [3. 1.]]
preprocessing: 11.0
manual computation: 5.5
l1_regularizer: 5.5
[[25. 4.]
 [ 9. 1.]]
39.0
preprocessing: 19.5
manual computation: 9.75
l2_regularizer: 9.75

注意：L2正则化的预处理数据是平方和除以2，这是方便处理加的一个系数，因为w平方求导之后会多出来一个系数2，有没有系数，优化过程都是一样进行的，减小a和减小10a是一样的训练目标。如果说正则化和主loss的比例不同，还有衰减系数可以调。

其实在复杂系统下直接写公式不如把基本loss和正则化项都丢进collection用着方便，何况你还可能把不同的weight设置不同的衰减系数呢是吧，这写成公式就很繁琐了。

虽然类似的方法还有batch normalization，dropout等，这些都有“加噪声”的效果，都有一定预防过拟合的效果。但是L1和L2正则化不叫L1 norm、L2 norm，norm叫范式，是计算距离的一种方法，就像绝对值和距离平方，不是regularization，L1 regularization和L2 regularization可以理解为用了L1 norm和L2 norm的regularization。

以上这篇tensorflow使用L2 regularization正则化修正overfitting过拟合方式就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持我们。

解决TensorFlow GPU版出现OOM错误的问题

问题: 在使用mask_rcnn预测自己的数据集时,会出现下面错误: ResourceExhaustedError: OOM when allocating tensor with shape[1,512,1120,1120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node rpn_model/rpn_conv_shared/convolution}} =
浅谈Tensorflow由于版本问题出现的几种错误及解决方法

1.AttributeError: 'module' object has no attribute 'rnn_cell' S:将tf.nn.rnn_cell替换为tf.contrib.rnn 2.TypeError: Expected int32, got list containing Tensors of type '_Message' instead. S:由于tf.concat的问题,将tf.concat(1, [conv1, conv2]) 的格式替换为tf.concat( [con
解决Tensorflow 使用时cpu编译不支持警告的问题

使用TensorFlow模块时,弹出错误Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 原因是下载TensorFlow的版本不支持cpu的AVX2编译. 可能是因为安装时使用的pip install tensorflow ,这样默认会下载X86_64的SIMD版本. 有两种解决办法: 1.忽略这个警告,不看它! import os os.environ["TF
tensorflow使用L2 regularization正则化修正overfitting过拟合方式

L2正则化原理: 过拟合的原理:在loss下降,进行拟合的过程中(斜线),不同的batch数据样本造成红色曲线的波动大,图中低点也就是过拟合,得到的红线点低于真实的黑线,也就是泛化更差. 可见,要想减小过拟合,减小这个波动,减少w的数值就能办到. L2正则化训练的原理:在Loss中加入(乘以系数λ的)参数w的平方和,这样训练过程中就会抑制w的值,w的(绝对)值小,模型复杂度低,曲线平滑,过拟合程度低(奥卡姆剃刀),参考公式如下图: (正则化是不阻碍你去拟合曲线的,并不是所有参数都会被无脑抑制,实
Python深度学习pyTorch权重衰减与L2范数正则化解析

下面进行一个高维线性实验假设我们的真实方程是: 假设feature数200,训练样本和测试样本各20个模拟数据集 num_train,num_test = 10,10 num_features = 200 true_w = torch.ones((num_features,1),dtype=torch.float32) * 0.01 true_b = torch.tensor(0.5) samples = torch.normal(0,1,(num_train+num_test,num_fe
Tensorflow 自定义loss的情况下初始化部分变量方式

一般情况下,tensorflow里面变量初始化过程为: #variables ........... #..................... init = tf.initialize_all_variables() sess.run(init) 这里 tf.initialize_all_variables() 会初始化所有的变量. 实际过程中,假设有a, b, c三个变量,其中a已经被初始化了,只想单独初始化b,c,那么: #variables ... ... init = tf.vari
Tensorflow 定义变量,函数,数值计算等名字的更新方式

左为旧版,右为更新到1.0版本后的名字定义变量的更新 tf.VARIABLES --> tf.GLOBAL_VARIABLES tf.all_variables --> tf.global_variables tf.initialize_all_variables --> tf.global_variables_initializer tf.initialize_local_variables --> tf.local_variables_initializer tf.initi
基于Tensorflow读取MNIST数据集时网络超时的解决方式

最近在学习TensorFlow,比较烦人的是使用tensorflow.examples.tutorials.mnist.input_data读取数据 from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets('/temp/mnist_data/') X = mnist.test.images.reshape(-1, n_steps, n_inputs) y = mnis
打印tensorflow恢复模型中所有变量与操作节点方式

我就废话不多说了,大家还是直接看代码吧! #参数恢复 self.sess=tf.Session() saver = tf.train.import_meta_graph(os.path.join(model_fullpath,'model.ckpt-7.meta')) module_file = tf.train.latest_checkpoint(model_fullpath) saver.restore(self.sess, module_file) variable_names = [v.
Tensorflow 2.4加载处理图片的三种方式详解

目录前言数据准备使用内置函数读取并处理磁盘数据自定义方式读取和处理磁盘数据从网络上下载数据前言本文通过使用 cpu 版本的 tensorflow 2.4 ,介绍三种方式进行加载和预处理图片数据. 这里我们要确保 tensorflow 在 2.4 版本以上 ,python 在 3.8 版本以上,因为版本太低有些内置函数无法使用,然后要提前安装好 pillow 和 tensorflow_datasets ,方便进行后续的数据加载和处理工作. 由于本文不对模型进行质量保证,只介绍数据的加
TensorFlow keras卷积神经网络添加L2正则化方式

我就废话不多说了,大家还是直接看代码吧! model = keras.models.Sequential([ #卷积层1 keras.layers.Conv2D(32,kernel_size=5,strides=1,padding="same",data_format="channels_last",activation=tf.nn.relu,kernel_regularizer=keras.regularizers.l2(0.01)), #池化层1 keras.l
PyTorch 实现L2正则化以及Dropout的操作

了解知道Dropout原理如果要提高神经网络的表达或分类能力,最直接的方法就是采用更深的网络和更多的神经元,复杂的网络也意味着更加容易过拟合. 于是就有了Dropout,大部分实验表明其具有一定的防止过拟合的能力. 用代码实现Dropout Dropout的numpy实现 PyTorch中实现dropout import torch.nn.functional as F import torch.nn.init as init import torch from torch.autograd
TensorFlow搭建神经网络最佳实践

一.TensorFLow完整样例在MNIST数据集上,搭建一个简单神经网络结构,一个包含ReLU单元的非线性化处理的两层神经网络.在训练神经网络的时候,使用带指数衰减的学习率设置.使用正则化来避免过拟合.使用滑动平均模型来使得最终的模型更加健壮. 程序将计算神经网络前向传播的部分单独定义一个函数inference,训练部分定义一个train函数,再定义一个主函数main. 完整程序: #!/usr/bin/env python3 # -*- coding: utf-8 -*- ""&

tensorflow使用L2 regularization正则化修正overfitting过拟合方式

相关推荐

随机推荐