机器学习基础（六）——逻辑回归Logistic Regression

逻辑回归是用来做二分类任务的，输出为：hθ(x)=g(θTx)=11+e−θTxh_\theta(x)=g(\theta^Tx)=\frac{1}{1+e{-\theta^Tx}}hθ(x)=g(θTx)=1+e−θTx1g(z)=11+e−zg(z)=\frac1{1+e^{-z}}g(z)=1+e−z1其中，zzz是线性回归的结果。cost(hθ(x),y)={−log(hθ(x))y=

Bayesian小孙

1402人浏览 · 2022-07-11 17:28:24

Bayesian小孙 · 2022-07-11 17:28:24 发布

文章目录

Logistic Regression

Logistic Regression

1.基础概念

逻辑回归是用来做二分类任务的，输出为：

$h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e{-\theta^Tx}}$

$g(z)=\frac1{1+e^{-z}}$

其中， $z$ 是线性回归的结果。

1.1 对数似然损失函数

$cost(h_\theta(x),y)= \left\{ \begin{array}{lc} -log(h_\theta(x)) & y=1 \\ -log(1-h_\theta(x))&y=0\\ \end{array} \right.$

1.2 完整的损失函数

$cost(h_\theta(x),y)=\sum\limits_{i=1}^m[-y_ilog(h_\theta(x))-(1-y_i)log(1-h_\theta(x))]$

损失函数是用均方误差来进行表示，所以不存在多个局部最低点，只有一个最小值。

对数似然函数可能有多个局部最小值，这个目前没有更好的方法来求全局最优解。

有两个方法来改善：

（1）求解开始的时候，多次随机初始化并比较结果。

（2）求解过程中，调整学习率。

尽管很难找到全局最优解，但是一般来讲效果都还是不错的。

对数似然函数的结构有点类似于信息熵，反映了一个信息的不确定性程度。

2.逻辑回归算法API

sklearn.linear_model.LogisticRegression

sklearn.linear_model.LogisticRegression(penalty=‘l2’, C = 1.0)

3.LogisticRegression回归案例

《良／恶性乳腺癌肿瘤预测》

原始数据的下载地址：https://archive.ics.uci.edu/ml/machine-learning-databases/

数据描述

（1）699条样本，共11列数据，第一列用语检索的id，后9列分别是与肿瘤相关的医学特征，最后一列表示肿瘤类型的数值。

（2）包含16个缺失值，用”?”标出。

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, SGDRegressor,  Ridge, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, classification_report
import joblib
import pandas as pd
import numpy as np

# 构造列标签名字
column = ['Sample code number','Clump Thickness', 'Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']

# 读取数据
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=column)

data.head()

	Sample code number	Clump Thickness	Uniformity of Cell Size	Uniformity of Cell Shape	Marginal Adhesion	Single Epithelial Cell Size	Bare Nuclei	Bland Chromatin	Normal Nucleoli	Mitoses	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Sample code number           699 non-null    int64 
 1   Clump Thickness              699 non-null    int64 
 2   Uniformity of Cell Size      699 non-null    int64 
 3   Uniformity of Cell Shape     699 non-null    int64 
 4   Marginal Adhesion            699 non-null    int64 
 5   Single Epithelial Cell Size  699 non-null    int64 
 6   Bare Nuclei                  699 non-null    object
 7   Bland Chromatin              699 non-null    int64 
 8   Normal Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB

# 缺失值进行处理
data = data.replace(to_replace='?', value=np.nan)

data = data.dropna()

# 进行数据的分割
x_train, x_test, y_train, y_test = train_test_split(data[column[1:10]], data[column[10]], test_size=0.25)

# 进行标准化处理
std = StandardScaler()

x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)

# 逻辑回归预测
lg = LogisticRegression(C=1.0)

lg.fit(x_train, y_train)

# 查看回归系数
print(lg.coef_) 

y_predict = lg.predict(x_test)
print("准确率：", lg.score(x_test, y_test))

print("召回率：", classification_report(y_test, y_predict, labels=[2, 4], target_names=["良性", "恶性"]))

[[ 1.1889528   0.11934019  0.74802964  0.9608045  -0.14967373  1.55680317
   0.78075779  0.86709826  0.68220413]]
准确率： 0.9707602339181286
召回率：               precision    recall  f1-score   support

          良性       0.96      0.99      0.98       111
          恶性       0.98      0.93      0.96        60

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171