使用LogisticRegression和LightGBM模型对信贷违约进行预测----基于kaggle比赛数据

3.分离数值型变量与类别型变量，发现有些数值型变量因为输入不规范，比如数值中含有字符（28_,_10000_等）被划分为类别型变量，比如 'Age'，'Annual_Income'，'Num_of_Loan'， 'Num_of_Delayed_Payment', 'Changed_Credit_Limit','Credit_Mix','Monthly_Balance'， 'Outstanding

I_am_a_big_tiger

1438人浏览 · 2023-06-07 18:11:37

I_am_a_big_tiger · 2023-06-07 18:11:37 发布

本人处于学习摸索阶段，文中错误难免，欢迎指正。

数据预测整体思路：

本文使用LogisticRegression建模对信贷业务进行违约预测，考虑该模型对变量的要求主要有：

1.变量之间不应存在较强的线性相关性和多重共线性

2.变量具有显著性：变量对应的系数P值，P值越小越好

3.变量具有合理的业务含义，符合风控逻辑：从系数的符号判定

4.缺失值和异常值对变量的影响较大

基于以上要求，对数据进行梳理，具体做法如下：

（一）数据预处理

1.了解数据自带的字典解释，查看数据整体特征，删除没有业务含义的变量。

2.分离数值型变量和类别型变量，部分数值型变量存在输入不规范问题

3.使用sql处理不规范输入问题

4.使用箱线图查看数值型变量分布的合理性，处理异常值

（二）数据相关性分析

1.线性相关性分析，使用pearson计算相关系数，相关性大于0.6的特征，使用pearsonr查看p值。

2.使用逐步回归法（stepwise regression）查看特征之间的共线性，筛选并删除具有多重共线性的变量，本文使用这种方法对特征做筛选。

（三）变量分箱，woe编码，iv值计算

1.变量分箱：特征分箱也是连续特征离散化，将连续性的特征进行分段，使其变成一个个离散的区间，离散化的特征对异常值有很强的鲁棒性，降低了模型过拟合的风险。首先根据woe的单调性查看合适的分箱区间，确定好分箱区间，对变量进行分箱、编码、计算iv值。

2.根据特征iv值，再一次筛选特征。

（四）模型训练，评估模型效果

1.模型划分为训练集和测试集

2.使用AUC和KS对模型效果进行评估

一、数据整体情况

1.数据读取：共有28个特征，100000行，同一个客户有多条业务记录，预测每一条记录违约情况

import  pandas as pd
import numpy as np
data = pd.read_csv(r"D:\new_job\KAGGLE\kaggle\train.csv")
print(data.columns)
print(data.shape)

Index(['ID', 'Customer_ID', 'Month', 'Name', 'Age', 'SSN', 'Occupation',
'Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts',
'Num_Credit_Card', 'Interest_Rate', 'Num_of_Loan', 'Type_of_Loan',
'Delay_from_due_date', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit',
'Num_Credit_Inquiries', 'Credit_Mix', 'Outstanding_Debt',
'Credit_Utilization_Ratio', 'Credit_History_Age',
'Payment_of_Min_Amount', 'Total_EMI_per_month',
'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance',
'Credit_Score'],dtype='object')

2.查看数据基本情况，部分变量中含有空值，删除没有用的特征：Name,SSN(社会保险号码)

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 100000 non-null object
1 Customer_ID 100000 non-null object
2 Month 100000 non-null object
3 Name 90015 non-null object
4 Age 100000 non-null object
5 SSN 100000 non-null object
6 Occupation 100000 non-null object
7 Annual_Income 100000 non-null object
8 Monthly_Inhand_Salary 84998 non-null float64
9 Num_Bank_Accounts 100000 non-null int64
10 Num_Credit_Card 100000 non-null int64
11 Interest_Rate 100000 non-null int64
12 Num_of_Loan 100000 non-null object
13 Type_of_Loan 88592 non-null object
14 Delay_from_due_date 100000 non-null int64
15 Num_of_Delayed_Payment 92998 non-null object
16 Changed_Credit_Limit 100000 non-null object
17 Num_Credit_Inquiries 98035 non-null float64
18 Credit_Mix 100000 non-null object
19 Outstanding_Debt 100000 non-null object
20 Credit_Utilization_Ratio 100000 non-null float64
21 Credit_History_Age 90970 non-null object
22 Payment_of_Min_Amount 100000 non-null object
23 Total_EMI_per_month 100000 non-null float64
24 Amount_invested_monthly 95521 non-null object
25 Payment_Behaviour 100000 non-null object
26 Monthly_Balance 98800 non-null object
27 Credit_Score 100000 non-null object
dtypes: float64(4), int64(4), object(20)
memory usage: 21.4+ MB

Nu_feature = list(data.select_dtypes(exclude=['object']).columns)
print('数值型变量:',Nu_feature)
Ca_feature = list(data.select_dtypes(include=['object']).columns)
print('类别型变量:',Ca_feature)

数值型变量: ['Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate', 'Delay_from_due_date', 'Num_Credit_Inquiries', 'Credit_Utilization_Ratio', 'Total_EMI_per_month']
类别型变量: ['ID', 'Customer_ID', 'Month', 'Name', 'Age', 'SSN', 'Occupation', 'Annual_Income', 'Num_of_Loan', 'Type_of_Loan', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Credit_Mix', 'Outstanding_Debt', 'Credit_History_Age', 'Payment_of_Min_Amount', 'Amount_invested_monthly', 'Payment_Behaviour', 'Monthly_Balance', 'Credit_Score']

4.预处理思路：先将不规范的输入统一处理，因自身使用习惯，此部分使用SQL处理，将特征Age、Annual_Income、Num_of_Delayed_Payment、Outstanding_Debt中的不规范字符去掉，将Credit_History_Age时间统一取年，Credit_Score 和Credit_Mix 01处理。

select *,replace(Age,'_','') Age_1,
       replace(Annual_Income,'_','') Annual_Income_1,
        replace(Num_of_Delayed_Payment,'_','') Num_of_Delayed_Payment_1,
        replace(Changed_Credit_Limit,'_','') Changed_Credit_Limit_1,
        replace(Outstanding_Debt,'_','') Outstanding_Debt_1,
        left(Credit_History_Age,2) Credit_History_Age_1,
        case when Credit_Mix='Good' or Credit_Mix='Standard' then 0 
             when  Credit_Mix='Bad' then 1 else null end as Credit_Mix_1,
        char_length(Type_of_Loan)- char_length(replace(Type_of_Loan,',','')) + 1 Num_of_Loan_1,
        case when Payment_of_Min_Amount='NO' or Payment_of_Min_Amount='NM' then 0
             when Payment_of_Min_Amount='Yes' then 1 end as Payment_of_Min_Amount_1,
        replace(Amount_invested_monthly,'__10000__','') Amount_invested_monthly_1,
        case when Credit_Score='Standard' or Credit_Score='Good' then 0 else 1 
        end as Credit_Score_1
from train；

5.查看规范之后的数值型特征统计值，画箱线图分析数值分布特征及合理性。通过箱线图，可以对变量的异常值有一个比较合理的判断，比如年龄，存在负值以及甚至上千的值，年收入和月收入可以结合起来看，存在异常大值，银行卡数量以及信用卡数量存在不合理的极大值，利率存在异常值，逾期天数从数值分布上存在极大值，但业务角度应是合理的。

##数值型变量统计基本情况
for col in Nu_feature:
    print(col,':',data[col].describe())

6.根据箱线图分析，对不同特征变量异常值处理采用不同的方式：Age将负值和大于100的值变为空值；综合考虑月收入和年收入，将年收入大于200000变为空值，依次对变量的范围进行设定。之所以变为空值，主要是考虑存在同一个客户（Customer_ID）有多条业务（ID）记录，因此可以使用相同客户id的数据来填充。使用SQL处理异常值。

select case when Age <0 or Age >100 then null  else Age end as Age,
         case when Annual_Income >200000 then null else Annual_Income end as Annual_Income,
         case when Interest_Rate>200 then null else Interest_Rate end as Interest_Rate_1,
         case when Num_Credit_Card>20 then null else Num_Credit_Card end as Num_Credit_Card,
         case when Num_Bank_Accounts<0 or Num_Bank_Accounts> 20 then null else Num_Bank_Accounts
         end as Num_Bank_Accounts_1,
         case when Num_Credit_Inquiries>50 then null else Num_Credit_Inquiries end as Num_Credit_Inquiries
from train；

7.先查看经过上一步处理之后数据的缺失情况。下一步处理思路，Credit_Mix 使用随机森林方法缺失值填充，Age，Monthly_Inhand_Salary,Num_of_Loan ，Credit_History_Age，Amount_invested_monthly ，Num_of_Delayed_Payment，Changed_Credit_Limit , Num_Credit_Inquiries,Monthly_Balance 等等使用同一个客户其他不缺失的值来填充。

##查看各个特征缺失率
data = pd.read_csv(r"D:\new_job\KAGGLE\kaggle\train_1.csv")
missingDf = df.isnull().sum().sort_values(ascending=False).reset_index()
missingDf.columns = ['feature','missing']
missingDf['per-missing'] = missingDf['missing']/df.shape[0]
print(missingDf)

feature missing per-missing
0 Credit_Mix 20195 0.20195
1 Monthly_Inhand_Salary 15002 0.15002
2 Num_of_Loan 11408 0.11408
3 Credit_History_Age 9030 0.09030
4 Amount_invested_monthly 8784 0.08784
5 Num_of_Delayed_Payment 7002 0.07002
6 Changed_Credit_Limit 2091 0.02091
7 Num_Credit_Inquiries 1965 0.01965
8 Monthly_Balance 1200 0.01200
9 Credit_Utilization_Ratio 0 0.00000
10 Payment_Behaviour 0 0.00000
11 Total_EMI_per_month 0 0.00000
12 Payment_of_Min_Amount 0 0.00000
13 ID 0 0.00000
14 Outstanding_Debt 0 0.00000
15 Customer_ID 0 0.00000
16 Delay_from_due_date 0 0.00000
17 Interest_Rate 0 0.00000
18 Num_Credit_Card 0 0.00000
19 Num_Bank_Accounts 0 0.00000
20 Annual_Income 0 0.00000
21 Age 0 0.00000
22 Credit_Score 0 0.00000

8.异常值补充方法：使用下一条数据补充异常值

'''
----异常值填充:用下一条值填充
Age,Monthly_Inhand_Salary,Num_of_Loan ，Credit_History_Age，Amount_invested_monthly ，
Num_of_Delayed_Payment，Changed_Credit_Limit , Num_Credit_Inquiries,Monthly_Balance 
'''
data['Age'].fillna(method='bfill', inplace=True)
data['Monthly_Inhand_Salary'].fillna(method='bfill', inplace=True)
data['Num_of_Loan'].fillna(0,inplace=True)  # 没有债务的用0填充
data['Credit_History_Age'].fillna(method='bfill', inplace=True)
data['Amount_invested_monthly'].fillna(method='bfill', inplace=True)
data['Num_of_Delayed_Payment'].fillna(method='bfill', inplace=True)
data['Changed_Credit_Limit'].fillna(method='bfill', inplace=True)
data['Num_Credit_Inquiries'].fillna(method='bfill', inplace=True)
data['Monthly_Balance'].fillna(method='bfill', inplace=True)
data['Num_Credit_Card'].fillna(method='bfill', inplace=True)
data['Interest_Rate'].fillna(method='bfill', inplace=True)
data['Num_Bank_Accounts'].fillna(method='bfill', inplace=True)
data['Annual_Income'].fillna(method='bfill', inplace=True)

9.随机森林填充异常值。到此数据基础处理完毕。

'''
Credit_Mix 特征处理:
---用Credit_Mix 特征值非空的样本构建训练集，用缺失的样本构建测试集
'''
from sklearn.ensemble import RandomForestRegressor
data = pd.read_csv(r"D:\new_job\KAGGLE\kaggle\train_2.csv")
rfDf = data.iloc[:,[8,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21]]  ##原始数据集中的无缺失数据特征
rfDf_train = rfDf.loc[rfDf['Credit_Mix'].notnull()]
rfDf_test = rfDf.loc[rfDf['Credit_Mix'].isnull()]

##划分训练数据和标签（Label）
X = rfDf_train.iloc[:,1:]
y = rfDf_train.iloc[:,0]
##训练过程
rf = RandomForestRegressor(random_state=0,n_estimators=20,max_depth=3,n_jobs=-1)
rf.fit(X,y)
##预测过程
pred = rf.predict(rfDf_test.iloc[:,1:]).round()
df.loc[(df['Credit_Mix'].isnull()),'Credit_Mix'] = pred
print(df['Credit_Mix'].describe())
data.to_csv(r"D:\new_job\KAGGLE\kaggle\train_3.csv")

二、相关性分析，特征筛选

1.计算皮尔逊相关系数，相关性大于0.6的特征，查看显著性，显著性小于0.05，认为显著。对于相关性大于0.6且关系显著的特征，保留其中区分性好、稳定性强的一个。

'''计算皮尔逊相关系数'''
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

data = pd.read_csv(r"D:\new_job\KAGGLE\kaggle\train_3.csv")
cor = data.columns[3:]
cor = list(cor)
pearson_mat = data[cor].corr(method='pearson')
plt.figure(figsize=(20,15))
sns.heatmap(pearson_mat,square=True,annot=True,cmap='YlGnBu')
plt.show()
'''
查看变量之间的显著性
'''
print(data.columns)
r1,p_value1 = pearsonr(data['Annual_Income'],data['Monthly_Inhand_Salary'])
print(r1,p_value1)  ##0.9744325673262497 0.0
r2,p_value2 = pearsonr(data['Credit_Mix'],data['Outstanding_Debt'])
print(r2,p_value2)
r3,p_value3 = pearsonr(data['Num_of_Loan'],data['Outstanding_Debt'])
print(r3,p_value3)
r4,p_value4 = pearsonr(data['Credit_Mix'],data['Num_of_Loan'])
print(r4,p_value4)
r5,p_value5 = pearsonr(data['Interest_Rate'],data['Outstanding_Debt'])
print(r5,p_value5)
r6,p_value6 = pearsonr(data['Interest_Rate'],data['Credit_Mix'])
print(r6,p_value6)
r7,p_value7 = pearsonr(data['Annual_Income'],data['Amount_invested_monthly'])
print(r7,p_value7)
r8,p_value8 = pearsonr(data['Monthly_Inhand_Salary'],data['Amount_invested_monthly'])
print(r8,p_value8)
r9,p_value9 = pearsonr(data['Credit_Mix'],data['Delay_from_due_date'])
print(r9,p_value9)

2.查看变量的多重共线性，使用逐步回归法（stepwise regression），筛选并删除引起多重共线性的方法。['Target', 'Credit_Utilization_Ratio', 'Num_of_Delayed_Payment', 'Credit_History_Age', 'Num_of_Loan', 'Age', 'Num_Credit_Card', 'Num_Bank_Accounts']，这些特征被留下。接下来的分析使用这些被保留的数据。

'''
使用逐步回归方法筛选并删除引起多重共线性的变量
'''
import  toad
final_data = toad.selection.stepwise(data,
                                     target=data.Target,
                                     estimator='lr',
                                     direction='both',
                                     criterion='aic',
                                     return_drop=False)
select_cols = final_data.columns
print(select_cols)

三、分箱，woe编码，计算IV值

1.先定义一个分箱函数，先观察各特征分箱情况，查看分箱单调性，找到分箱区间。发现woe都具有较好的单调性，可以采用以上分箱区间。

from scipy import stats  ##导入一个推断包

def optimal_bins(Y,X,n):
    '''
    :param Y: 目标变量
    :param X:待分箱特征
    :param n:分箱数初始值
    :return:统计值，分箱边界值列表，woe，iv
    '''
    r = 0
    total_bad =Y.sum() ##总的坏样本
    total_good = Y.count()-total_bad ##总的好样本
    ##分箱过程
    while np.abs(r)<1:
        df1 = pd.DataFrame({'X':X,'Y':Y,'bin':pd.qcut(X,n,duplicates='drop')})
        df2 = df1.groupby('bin')
        r,p = stats.spearmanr(df2.mean().X,df2.mean().Y)
        n=n-1
    ##计算woe和iv值
    df3 = pd.DataFrame()
    df3['min_'+X.name] = df2.min().X
    df3['max_'+X.name] = df2.max().X
    df3['sum'] = df2.sum().Y
    df3['total'] = df2.count().Y
    df3['rate'] = df2.mean().Y
    df3['badattr'] = df3['sum']/total_bad
    df3['goodattr'] = (df3['total']-df3['sum'])/total_good
    df3['woe'] = np.log(df3['badattr']/df3['goodattr'])
    iv = ((df3['badattr']-df3['goodattr'])*df3['woe']).sum()
    df3 = df3.sort_values(by='min_'+X.name).reset_index(drop=True)
    ##分箱边界值列表
    cut = []
    cut.append(float('-inf'))
    for i in range(1,n+1):
        qua = X.quantile(i/(n+1))
        cut.append(round(qua,6))
    cut.append(float('inf'))
    ##woe值列表
    woe = list(df3['woe'])
    return df3,cut,woe,iv
##观察各特征分箱情况，例如
'Delay_from_due_date'
df_dfdd,cut_dfdd,woe_dfdd,iv_dfdd = optimal_bins(data.Target,data.Delay_from_due_date,n=10)
print(df_dfdd)

min_Delay_from_due_date max_Delay_from_due_date sum total rate \
0 -5 7 1959 17275 0.113401
1 8 11 1898 13020 0.145776
2 12 15 2173 13474 0.161274
3 16 21 4560 15265 0.298723
4 22 27 4550 14769 0.308078
5 28 38 5021 12434 0.403812
6 39 67 8837 13763 0.642084

badattr goodattr woe
0 0.067556 0.215712 -1.160983
1 0.065453 0.156643 -0.872643
2 0.074936 0.159165 -0.753301
3 0.157252 0.150770 0.042093
4 0.156907 0.143926 0.086360
5 0.173150 0.104406 0.505875
6 0.304745 0.069378 1.479901

2.分箱，woe编码，计算iv值，筛选变量，其他变量iv值大于0.02，Credit_Utilization_Ratio iv= 0.004521，删除。

'''定义分箱函数'''
def custom_bins(Y,X,binlist):
    '''
    :param Y: 目标变量
    :param X: 待分箱特征
    :param binlist: 分箱边界值列表
    :return: 统计值，woe值，iv值
    '''
    r = 0
    total_bad = Y.sum()  ##总的坏样本
    total_good = Y.count() - total_bad  ##总的好样本
    #等距分箱
    df1 = pd.DataFrame({'X':X,'Y':Y,'bin':pd.cut(X,binlist)})
    df2 = df1.groupby('bin',as_index=True)
    r,p = stats.spearmanr(df2.mean().X,df2.mean().Y)
    df3 = pd.DataFrame()
    df3['min_' + X.name] = df2.min().X
    df3['max_' + X.name] = df2.max().X
    df3['sum'] = df2.sum().Y
    df3['total'] = df2.count().Y
    df3['rate'] = df2.mean().Y
    df3['badattr'] = df3['sum'] / total_bad
    df3['goodattr'] = (df3['total'] - df3['sum']) / total_good
    df3['woe'] = np.log(df3['badattr'] / df3['goodattr'])
    iv = ((df3['badattr'] - df3['goodattr']) * df3['woe']).sum()
    df3 = df3.sort_values(by='min_' + X.name).reset_index(drop=True)
    woe = list(df3['woe'])

    return df3,woe,iv

'''自定义分箱区间如下'''
#原始特征
ninf = float('-inf')
pinf = float('inf')
cut_cur = [ninf, 28.052567, 32.305784, 36.496663, pinf]
cut_ndp = [ninf,6.0, 9.0, 11.0, 14.0, 16.0, 18.0, 21.0,pinf]
cut_cha = [ninf,12.0, 18.0, 25.0,pinf]
cut_nol = [ninf, 2.0, 3.0, 5.0,pinf]
cut_age = [ninf,24.0, 33.0, 42.0,pinf]
cut_ncc = [ninf,4.0, 5.0, 7.0,pinf]
cut_nba = [ninf,3.0, 5.0, 7.0,pinf]


##查看统计值、woe、iv
df_cur,woe_cur,iv_cur = custom_bins(data.Target,data.Credit_Utilization_Ratio,cut_cur)
df_ndp,woe_ndp,iv_ndp = custom_bins(data.Target,data.Num_of_Delayed_Payment,cut_ndp)
df_cha,woe_cha,iv_cha = custom_bins(data.Target,data.Credit_History_Age,cut_cha)
df_nol,woe_nol,iv_nol = custom_bins(data.Target,data.Num_of_Loan,cut_nol)
df_ir,woe_ir,iv_ir = custom_bins(data.Target,data.Interest_Rate,cut_ir)
df_age,woe_age,iv_age = custom_bins(data.Target,data.Age,cut_age)
df_ncc,woe_ncc,iv_ncc = custom_bins(data.Target,data.Num_Credit_Card,cut_ncc)
df_nba,woe_nba,iv_nba = custom_bins(data.Target,data.Num_Bank_Accounts,cut_nba)

'''woe编码'''
data['Credit_Utilization_Ratio'] = pd.cut(data['Credit_Utilization_Ratio'],bins=cut_cur,labels=woe_cur)
data['Num_of_Delayed_Payment'] = pd.cut(data['Num_of_Delayed_Payment'],bins=cut_ndp,labels=woe_ndp)
data['Credit_History_Age'] = pd.cut(data['Credit_History_Age'],bins=cut_cha,labels=woe_cha)
data['Num_of_Loan'] = pd.cut(data['Num_of_Loan'],bins=cut_nol,labels=woe_nol)
data['Age'] = pd.cut(data['Age'],bins=cut_age,labels=woe_age)
data['Num_Credit_Card'] = pd.cut(data['Num_Credit_Card'],bins=cut_ncc,labels=woe_ncc)
data['Num_Bank_Accounts'] = pd.cut(data['Num_Bank_Accounts'],bins=cut_nba,labels=woe_nba)


df = data[['Target','Credit_Utilization_Ratio','Num_of_Delayed_Payment','Credit_History_Age',
           'Num_of_Loan','Age','Num_Credit_Card','Num_Bank_Accounts']]



# df.to_csv(r"D:\new_job\KAGGLE\kaggle\train_4_woe.csv")


'''查看iv列表'''
df = pd.read_csv(r"D:\new_job\KAGGLE\kaggle\train_4_woe.csv")
ivDf = pd.DataFrame(columns=['feature','iv'])
feaList = list(df.columns[1:])
ivList = [iv_cur,iv_ndp,iv_cha,iv_nol,iv_age,iv_ncc,iv_nba]
for i,x in enumerate(feaList):
    ivDf.loc[i,'feature'] = x
    ivDf.loc[i,'iv'] = ivList[i]
ivDf = ivDf.sort_values(by='iv',ascending=False).reset_index(drop=True)
print(ivDf)
# ivDf.to_csv(r"D:\new_job\KAGGLE\archive\ivDf.csv")

四、变量入模训练，auc=0.76，ks=0.47

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc,roc_curve
from sklearn.model_selection import train_test_split
'''
LogisticRegression 一些重要参数的默认值
penalty：正则化类型，默认值'L2',当solver='liblinear'时，还可以选择‘l1’
tol:迭代终止的阈值，默认值为le-4
max_iter：最大迭代次数，默认值为100
'''
'划分训练集和测试集'
df = pd.read_csv(r"D:\new_job\KAGGLE\kaggle\train_4_woe.csv")
X = df.iloc[:,1:]
Y = df.iloc[:,0]
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=0)

'模型训练'
lr = LogisticRegression(random_state=0,solver='liblinear',class_weight={0:0.4,1:0.6},penalty='l1')
k_train = lr.fit(X_train,Y_train)

'模型预测'
Y_pred = k_train.predict(X_test)
Y_score = lr.decision_function(X_test)

'''模型结果评估'''
fpr1,tpr1,threshold = roc_curve(Y_test,Y_score)
auc_value = auc(fpr1,tpr1)

#画图
plt.figure(figsize=(20,15))
plt.plot(fpr1, tpr1, color='darkorange',label='ROC curve (area = %0.2f)' % auc_value)
plt.plot([0, 1], [0, 1], color='navy',  linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC_curve')
plt.legend(loc="lower right")
plt.show()
print('AUC值：',auc_value)

'计算KS值'
fig, ax = plt.subplots()
ax.plot(1 - threshold, tpr1, label='tpr') # ks曲线要按照预测概率降序排列，所以需要1-threshold镜像
ax.plot(1 - threshold, fpr1, label='fpr')
ax.plot(1 - threshold, tpr1-fpr1,label='KS')
#画图
plt.xlabel('score')
plt.title('KS Curve')
plt.ylim([0.0, 1.0])
plt.figure(figsize=(20,20))
legend = ax.legend(loc='upper left')
plt.show()
print('KS值：',max(tpr1-fpr1))

五、模型AUC=0.76、KS=0.47效果。

技术共进，成长同行——讯飞AI开发者社区

更多推荐

PHP与人工智能：结合案例与可能性探索

讯飞AI开发者社区

[论文阅读] 人工智能 + 软件工程 | 真实场景下GitHub Copilot生产力之谜：2年数据揭示客观提交无提升，开发者却直呼“好用”

讯飞AI开发者社区

上下文工程驱动智能体向通用人工智能

例如在文章开头，我们举的产品经理和工程师之间的那一段对话，一个高质量智能体，不再只是让大模型回答用户的问题，而是通过上下文工程，帮助大模型在回答前获得更加结构化的输入，包括项目状态、需求文档、任务历史、甚至团队氛围，实现大模型更好的理解当前的任务规划、团队过往的沟通隐患、对方的工作状态与担忧、文档/知识库的实时状态等等。这和我们维护我们手机上内存很像，一开始所有应用和历史信息都保留，但当手机出现运