机器学习之预处理pyspark和sklearn相似处理比较(持续更新中)
库sklearn: import sklearn.preprocessingpyspark: import.ml.featureBinarizer 根据阈值进行二值化处理小于等于阈值的设置为0大于阈值的设置为1sklear.preprocessing.Binarizer说明:sklearn 的Binarizer 只能处理2D数组demo:x=np.array([[1,2...
·
库
- sklearn: import sklearn.preprocessing
- pyspark: import.ml.feature
MinMaxScaler 归一化到 [0 1]
- 原理
Xs caled =(X−X.min(axis=0))(X.max(axis=0)−X⋅min(axis=0))⋅(max−min)+min X_{s} \text { caled }=\frac{(X-X . \min (a x i s=0))}{(X . \max (a x i s=0)-X \cdot \min (a x i s=0))} \cdot(\max -\min )+\min Xs caled =(X.max(axis=0)−X⋅min(axis=0))(X−X.min(axis=0))⋅(max−min)+min
对每一列做归一化处理 - sklearn.preprocessing.MinMaxScaler(copy=True,feature_range(0,1))
X,y=make_blobs(n_samples=40,centers=2,random_state=50,cluster_std=2)
plt.subplot(121)
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.cool)
plt.subplot(122)
X_2=MinMaxScaler().fit_transform(X)
plt.scatter(X_2[:,0],X_2[:,1],c=y,cmap=plt.cm.cool)
plt.show()
- pyspark.ml.feature.MinMaxScaler
MinMaxScaler(self, min=0.0, max=1.0, inputCol=None, outputCol=None)
StandardScaler
数据标准化处理
- 标准化
在机器学习中,我们可能要处理不同种类的资料,例如,音讯和图片上的像素值,这些资料可能是高维度的,资料标准化后会使每个特征中的数值平均变为0(将每个特征的值都减掉原始资料中该特征的平均)、标准差变为1,这个方法被广泛的使用在许多机器学习算法中。 - sklearn.preprocessing.StandardScaler
StandardScaler(copy=True, with_mean=True, with_std=True)
说明:
If you set with_mean and with_std to False, then the mean μ is set to 0 and the std to 1, assuming that the columns/features are coming from the normal gaussian distribution (which has 0 mean and 1 std).
If you set with_mean and with_std to True, then you will actually use the true μ and σ of your data. This is the most common approach.
即 当with_std=True 以及with_mean=True时,μ is set to 0 and the std to 1。
当 with_std=False 以及with_mean=False时μ和 std 来源于原数据
demo:
X = np.array([[1., -1., 2.],
[2., 0., 0.],
[0., 1., -1.]])
scaler = sklearn.preprocessing.StandardScaler(with_mean=True, with_std=True).fit(X)
# print(scaler.mean_)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=False, with_mean=False).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=True, with_mean=False).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
print("*" * 11)
scaler = sklearn.preprocessing.StandardScaler(with_std=False, with_mean=True).fit(X)
print(scaler.__dict__)
print(scaler.transform(X))
输出
>{'with_mean': True, 'with_std': True, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1. , 0. , 0.33333333]), 'var_': array([0.66666667, 0.66666667, 1.55555556]), 'scale_': array([0.81649658, 0.81649658, 1.24721913])}
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
***********
{'with_mean': False, 'with_std': False, 'copy': True, 'n_samples_seen_': 3, 'mean_': None, 'var_': None, 'scale_': None}
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
***********
{'with_mean': False, 'with_std': True, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1. , 0. , 0.33333333]), 'var_': array([0.66666667, 0.66666667, 1.55555556]), 'scale_': array([0.81649658, 0.81649658, 1.24721913])}
[[ 1.22474487 -1.22474487 1.60356745]
[ 2.44948974 0. 0. ]
[ 0. 1.22474487 -0.80178373]]
***********
{'with_mean': True, 'with_std': False, 'copy': True, 'n_samples_seen_': 3, 'mean_': array([1. , 0. , 0.33333333]), 'var_': None, 'scale_': None}
[[ 0. -1. 1.66666667]
[ 1. 0. -0.33333333]
[-1. 1. -1.33333333]]
- pyspark.ml.features.StandardScaler
StandardScaler(withMean=False, withStd=True, inputCol=None, outputCol=None)
当withMean为true,withStd为false时,向量中的各元素均减去它相应的均值。当withMean和withStd均为true时,各元素在减去相应的均值之后,还要除以它们相应的方差。 当withMean为true,程序只能处理稠密的向量,不能处理稀疏向量。
demo
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(Vectors.dense([0.0]),), (Vectors.dense([2.0]),)], ["a"])
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=True,withMean=True)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=True,withMean=False)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=False,withMean=False)
model = standardScaler.fit(df)
model.transform(df).show()
print("*"*22)
standardScaler = pyspark.ml.feature.StandardScaler(inputCol="a", outputCol="scaled",withStd=False,withMean=True)
model = standardScaler.fit(df)
model.transform(df).show()
结果:
Binarizer 根据阈值进行二值化处理
小于等于阈值的设置为0
大于阈值的设置为1
- sklear.preprocessing.Binarizer
说明:
sklearn 的Binarizer 只能处理2D数组
demo:
x=np.array([[1,2,3.4],[2.1,1.3,-10]])
transformer=sklearn.preprocessing.Binarizer(threshold=2).fit(x)
print(transformer.transform(x))
binarizer=sklearn.preprocessing.Binarizer(threshold=2)
print(binarizer.fit_transform([[1,2,3,4],[2,3,4,5]]))
输出:
>[[0. 0. 1.]
[1. 0. 0.]]
[[0 0 1 1]
[0 1 1 1]]
fit transform 以及fit_transform的区别
-
fit: when you want to train your model without any pre-processing on
the data -
transform: when you want to do pre-processing on the data using one of the functions from sklearn.preprocessing
-
fit_transform(): It’s same as calling fit() and then transform() - a shortcut
- pyspark.ml.feature.Binarizer
Binarizer(self, threshold=0.0, inputCol=None, outputCol=None)
demo
df=spark.createDataFrame([(0.5,),(2.,),(3.,)],['values'])
df.show()
binarizer=pyspark.ml.feature.Binarizer(threshold=2,inputCol='values',outputCol='features')
binarizer.transform(df).show()
"""
+------+
|values|
+------+
| 0.5|
| 2.0|
| 3.0|
+------+
+------+--------+
|values|features|
+------+--------+
| 0.5| 0.0|
| 2.0| 0.0|
| 3.0| 1.0|
+------+--------+
"""
更多推荐
所有评论(0)