机器学习之预处理pyspark和sklearn相似处理比较(持续更新中)

库sklearn: import sklearn.preprocessingpyspark: import.ml.featureBinarizer 根据阈值进行二值化处理小于等于阈值的设置为0大于阈值的设置为1sklear.preprocessing.Binarizer说明：sklearn 的Binarizer 只能处理2D数组demo：x=np.array([[1,2...

NoOne-csdn

1466人浏览 · 2020-01-15 13:44:29

NoOne-csdn · 2020-01-15 13:44:29 发布

库

sklearn: import sklearn.preprocessing
pyspark: import.ml.feature

MinMaxScaler 归一化到 [0 1]

原理
$X_{s} \text { caled }=\frac{(X-X . \min (a x i s=0))}{(X . \max (a x i s=0)-X \cdot \min (a x i s=0))} \cdot(\max -\min )+\min$
对每一列做归一化处理
sklearn.preprocessing.MinMaxScaler(copy=True,feature_range(0,1))

X,y=make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
plt.subplot(121)
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.cool)

plt.subplot(122)
X_2=MinMaxScaler().fit_transform(X)
plt.scatter(X_2[:,0],X_2[:,1],c=y,cmap=plt.cm.cool)
plt.show()

在这里插入图片描述

pyspark.ml.feature.MinMaxScaler
MinMaxScaler(self, min=0.0, max=1.0, inputCol=None, outputCol=None)

StandardScaler

数据标准化处理

标准化
在机器学习中，我们可能要处理不同种类的资料，例如，音讯和图片上的像素值，这些资料可能是高维度的，资料标准化后会使每个特征中的数值平均变为0(将每个特征的值都减掉原始资料中该特征的平均)、标准差变为1，这个方法被广泛的使用在许多机器学习算法中。
sklearn.preprocessing.StandardScaler
StandardScaler(copy=True, with_mean=True, with_std=True)
说明：
If you set with_mean and with_std to False, then the mean μ is set to 0 and the std to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).
If you set with_mean and with_std to True, then you will actually use the true μ and σ of your data. This is the most common approach.

即当with_std=True 以及with_mean=True时，μ is set to 0 and the std to 1。
当 with_std=False 以及with_mean=False时μ和 std 来源于原数据
demo：

X = np.array([[1., -1., 2.],
              [2., 0., 0.],
              [0., 1., -1.]])
scaler = sklearn.preprocessing.StandardScaler(with_mean=True, with_std=True).fit(X)
# print(scaler.mean_)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=False, with_mean=False).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)

scaler = sklearn.preprocessing.StandardScaler(with_std=True, with_mean=False).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=False, with_mean=True).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
输出
>{'with_mean': True, 'with_std': True, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1.        , 0.        , 0.33333333]), 'var_': array([0.66666667, 0.66666667, 1.55555556]), 'scale_': array([0.81649658, 0.81649658, 1.24721913])}
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
***********
{'with_mean': False, 'with_std': False, 'copy': True, 'n_samples_seen_': 3, 'mean_': None, 'var_': None, 'scale_': None}
[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]]
***********
{'with_mean': False, 'with_std': True, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1.        , 0.        , 0.33333333]), 'var_': array([0.66666667, 0.66666667, 1.55555556]), 'scale_': array([0.81649658, 0.81649658, 1.24721913])}
[[ 1.22474487 -1.22474487  1.60356745]
 [ 2.44948974  0.          0.        ]
 [ 0.          1.22474487 -0.80178373]]
***********
{'with_mean': True, 'with_std': False, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1.        , 0.        , 0.33333333]), 'var_': None, 'scale_': None}
[[ 0.         -1.          1.66666667]
 [ 1.          0.         -0.33333333]
 [-1.          1.         -1.33333333]]

pyspark.ml.features.StandardScaler
StandardScaler(withMean=False, withStd=True, inputCol=None, outputCol=None)
当withMean为true，withStd为false时，向量中的各元素均减去它相应的均值。当withMean和withStd均为true时，各元素在减去相应的均值之后，还要除以它们相应的方差。当withMean为true，程序只能处理稠密的向量，不能处理稀疏向量。
demo

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=True,withMean=True)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=True,withMean=False)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=False,withMean=False)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=False,withMean=True)
model = standardScaler.fit(df)
model.transform(df).show()

结果：
在这里插入图片描述

Binarizer 根据阈值进行二值化处理

小于等于阈值的设置为0
大于阈值的设置为1

sklear.preprocessing.Binarizer
说明：
sklearn 的Binarizer 只能处理2D数组
demo：

x=np.array([[1,2,3.4],[2.1,1.3,-10]])
transformer=sklearn.preprocessing.Binarizer(threshold=2).fit(x)
print(transformer.transform(x))

binarizer=sklearn.preprocessing.Binarizer(threshold=2)
print(binarizer.fit_transform([[1,2,3,4],[2,3,4,5]]))

输出：
>[[0. 0. 1.]
 [1. 0. 0.]]
 
[[0 0 1 1]
 [0 1 1 1]]

fit transform 以及fit_transform的区别

fit: when you want to train your model without any pre-processing on
the data
transform: when you want to do pre-processing on the data using one of the functions from sklearn.preprocessing
fit_transform(): It’s same as calling fit() and then transform() - a shortcut

pyspark.ml.feature.Binarizer
Binarizer(self, threshold=0.0, inputCol=None, outputCol=None)
demo

df=spark.createDataFrame([(0.5,),(2.,),(3.,)],['values'])
df.show()
binarizer=pyspark.ml.feature.Binarizer(threshold=2,inputCol='values',outputCol='features')
binarizer.transform(df).show()

"""
+------+
|values|
+------+
|   0.5|
|   2.0|
|   3.0|
+------+

+------+--------+
|values|features|
+------+--------+
|   0.5|     0.0|
|   2.0|     0.0|
|   3.0|     1.0|
+------+--------+
"""