机器学习基础(六)——逻辑回归Logistic Regression
逻辑回归是用来做二分类任务的,输出为:hθ(x)=g(θTx)=11+e−θTxh_\theta(x)=g(\theta^Tx)=\frac{1}{1+e{-\theta^Tx}}hθ(x)=g(θTx)=1+e−θTx1g(z)=11+e−zg(z)=\frac1{1+e^{-z}}g(z)=1+e−z1其中,zzz是线性回归的结果。cost(hθ(x),y)={−log(hθ(x))y=
Logistic Regression
1.基础概念
逻辑回归是用来做二分类任务的,输出为:
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e{-\theta^Tx}} hθ(x)=g(θTx)=1+e−θTx1
g ( z ) = 1 1 + e − z g(z)=\frac1{1+e^{-z}} g(z)=1+e−z1
其中, z z z是线性回归的结果。
1.1 对数似然损失函数
c o s t ( h θ ( x ) , y ) = { − l o g ( h θ ( x ) ) y = 1 − l o g ( 1 − h θ ( x ) ) y = 0 cost(h_\theta(x),y)= \left\{ \begin{array}{lc} -log(h_\theta(x)) & y=1 \\ -log(1-h_\theta(x))&y=0\\ \end{array} \right. cost(hθ(x),y)={−log(hθ(x))−log(1−hθ(x))y=1y=0
1.2 完整的损失函数
c o s t ( h θ ( x ) , y ) = ∑ i = 1 m [ − y i l o g ( h θ ( x ) ) − ( 1 − y i ) l o g ( 1 − h θ ( x ) ) ] cost(h_\theta(x),y)=\sum\limits_{i=1}^m[-y_ilog(h_\theta(x))-(1-y_i)log(1-h_\theta(x))] cost(hθ(x),y)=i=1∑m[−yilog(hθ(x))−(1−yi)log(1−hθ(x))]
损失函数是用均方误差来进行表示,所以不存在多个局部最低点,只有一个最小值。
对数似然函数可能有多个局部最小值,这个目前没有更好的方法来求全局最优解。
有两个方法来改善:
(1)求解开始的时候,多次随机初始化并比较结果。
(2)求解过程中,调整学习率。
尽管很难找到全局最优解,但是一般来讲效果都还是不错的。
对数似然函数的结构有点类似于信息熵,反映了一个信息的不确定性程度。
2.逻辑回归算法API
sklearn.linear_model.LogisticRegression
sklearn.linear_model.LogisticRegression(penalty=‘l2’, C = 1.0)
3.LogisticRegression回归案例
《良/恶性乳腺癌肿瘤预测》
原始数据的下载地址:https://archive.ics.uci.edu/ml/machine-learning-databases/
数据描述
(1)699条样本,共11列数据,第一列用语检索的id,后9列分别是与肿瘤相关的医学特征,最后一列表示肿瘤类型的数值。
(2)包含16个缺失值,用”?”标出。
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, classification_report
import joblib
import pandas as pd
import numpy as np
# 构造列标签名字
column = ['Sample code number','Clump Thickness', 'Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
# 读取数据
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=column)
data.head()
Sample code number | Clump Thickness | Uniformity of Cell Size | Uniformity of Cell Shape | Marginal Adhesion | Single Epithelial Cell Size | Bare Nuclei | Bland Chromatin | Normal Nucleoli | Mitoses | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sample code number 699 non-null int64
1 Clump Thickness 699 non-null int64
2 Uniformity of Cell Size 699 non-null int64
3 Uniformity of Cell Shape 699 non-null int64
4 Marginal Adhesion 699 non-null int64
5 Single Epithelial Cell Size 699 non-null int64
6 Bare Nuclei 699 non-null object
7 Bland Chromatin 699 non-null int64
8 Normal Nucleoli 699 non-null int64
9 Mitoses 699 non-null int64
10 Class 699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.2+ KB
# 缺失值进行处理
data = data.replace(to_replace='?', value=np.nan)
data = data.dropna()
# 进行数据的分割
x_train, x_test, y_train, y_test = train_test_split(data[column[1:10]], data[column[10]], test_size=0.25)
# 进行标准化处理
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
# 逻辑回归预测
lg = LogisticRegression(C=1.0)
lg.fit(x_train, y_train)
# 查看回归系数
print(lg.coef_)
y_predict = lg.predict(x_test)
print("准确率:", lg.score(x_test, y_test))
print("召回率:", classification_report(y_test, y_predict, labels=[2, 4], target_names=["良性", "恶性"]))
[[ 1.1889528 0.11934019 0.74802964 0.9608045 -0.14967373 1.55680317
0.78075779 0.86709826 0.68220413]]
准确率: 0.9707602339181286
召回率: precision recall f1-score support
良性 0.96 0.99 0.98 111
恶性 0.98 0.93 0.96 60
accuracy 0.97 171
macro avg 0.97 0.96 0.97 171
weighted avg 0.97 0.97 0.97 171
更多推荐
所有评论(0)