不存在有帮助;本文试图从数据分析师的角度进行分析,假设“拿到数据后,如何寻找规律,选择哪种模型来构建反欺诈模型?”主要面向业务,不深究算法原理;下一篇文章将讲解数据结构极度不平衡时如何修正数据集以及如何调整参数。数据来源和项目概览数据是在kaggle上看到的项目。具体链接如下:https://www.kaggle.com/mlg-ulb/creditcardfraud获取本例数据,可在上述项目详情链接下载数据。该数据集包含2013年9月欧洲持卡人使用信用卡进行的交易。该数据集提供了两天内发生的交易,284,807笔交易中有492起欺诈行为。数据集非常不平衡,负类(欺诈)占所有交易的0.172%。它仅包含数值输入变量,这些变量是PCA转换的结果。不幸的是,出于保密考虑,我们无法提供有关数据的原始特征和更多背景信息。特征V1、V2、...V28是通过PCA获得的主要成分,唯一未通过PCA转换的特征是“时间”和“金额”。“时间”包含每个事务与数据集中第一个事务之间经过的秒数。'Amount'为交易金额,该特征可用于依赖实例的成本敏感性学习。“类别”是响应变量,在欺诈情况下值为1,否则为0。2、准备并初步查看数据集#导入包importnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltimportmatplotlib.gridspecasgridspecimportseabornassns;plt.style.use('ggplot')importsklearnfromsklearn.preprocessingimportStandardScalerfromsklearn.model_selectionimporttrain_test_splitfromsklearn.utilsimportshufflefromsklearn.metricsimportconfusion_matrixfromsklearn.manifoldimportTSNEpass#倒入并查看数据crecreditcard_data=pd.read_csv('./creditcard.csv')crecreditcard_data.shape,crecreditcard_data.info()RangeIndex:284807entries,0to284806Datacolumns(total31columns):Time284807non-nullfloat64V1284807non-nullfloat64V2284807non-nullfloat64V3284807non-nullfloat64V4284807non-nullfloat64V5284807non-nullfloat64V6284807non-nullfloat64V7284807non-nullfloat64V8284807non-nullfloat64V9284807non-nullfloat64V10284807non-nullfloat64V11284807non-nullfloat64V12284807non-nullfloat64V13284807non-nullfloat64V14284807non-nullfloat64V15284807non-nullfloat64V16284807non-nullfloat64V17284807non-nullfloat64V18284807non-nullfloat64V19284807non-nullfloat64V20284807non-nullfloat64V21284807non-nullfloat64V22284807non-nullfloat64V23284807non-nullfloat64V24284807non-nullfloat64V25284807non-nullfloat64V26284807non-nullfloat64V27284807non-nullfloat64V28284807non-nullfloat64Amount284807non-nullfloat64Class284807non-nullint64dtypes:float64(30),int64(1)memoryusage:67.4MB((284807,31),none)crecreditcard_data.describe()passcrecreditcard_data.head()pass#看欺诈和非欺诈的比例count_classes=pd.value_counts(crecreditcard_data['Class'],sort=True).sort_index()#具体统计数据count_classes.value_counts()#也可以用count_classes[0],count_classes[1]看各自的数据28431514921Name:Class,dtype:int64count_classes.plot(kind='bar')plt.show()0表示正常,1表示欺诈,两者数量严重失衡,极度失衡,根本不在一个数量级3.欺诈与时间序列分布的关系#查看两者的描述性统计,与时间的序列分布关系print('Normal')print(crecreditcard_data.Time[crecreditcard_data.Class==0].describe())print('-'*25)print('fraud')print(crecreditcard_data.time[crecreditcard_data.class==1].describe().describe():Time,dtype:float64-------------------------Fraudcount492.000000mean80746.806911std47835.365138min406.00000025%41241.50000050%75568.50000075%128483.000000max170348.000d000typeName::float64f,(ax1,ax2)=plt.subplots(2,1,sharex=True,figsize=(12,6))bins=50ax1.hist(crecreditcard_data.Time[crecreditcard_data.Class==1],bins=bins)ax1.set_title('欺诈(欺诈))',fontsize=22)ax1.set_ylabel('交易量',fontsize=15)ax2.hist(crecreditcard_data.Time[crecreditcard_data.Class==0],bins=bins)ax2.set_title('正常(正常',fontsize=22)plt.xlabel('时间(单位:秒)',fontsize=15)plt.xticks(fontsize=15)plt.ylabel('交易量',fontsize=15)#plt.yticks(fontsize=22)plt.show()造假跟时间没有必然关系,没有周期性;正常交易有明显的周期性,有类似双峰的趋势4.欺诈与金额的关系及分布print('欺诈')print(crecreditcard_data.Amount[crecreditcard_data.Class==1].describe())print('-'*25)print('正常交易')print(crecreditcard_data.Amount[crecreditcard_data.Class==0].describe())fraudcount492.000000mean122.211321std256.683288min0.00000025%1.00000050%9.25000075%105.890000max2125.87000%Name----type:fount,-,type---金额--------------正常交易计数284315.000000mean88.291022std250.105092min0.00000025%5.65000050%22.00000075%77.050000max25691.160000Name:Amount,dtype:=float164f2.subplots(2,1,sharex=True,figsize=(12,6))bins=30ax1.hist(crecreditcard_data.Amount[crecreditcard_data.Class==1],bins=bins)ax1.set_title('欺诈(欺诈)',fontsize=22)ax1.set_ylabel('交易量',fontsize=15)ax2.hist(crecreditcard_data.Amount[crecreditcard_data.Class==0],bins=bins)ax2.set_title('Normal(正常)',fontsize=22)plt.xlabel('金额($)',fontsize=15)plt.xticks(fontsize=15)plt.ylabel('交易量',fontsize=15)plt.yscale('log')plt.show()的量普遍偏低,可见amount栏的数据对分析的参考价值不大。5.检查每个自变量(V1-V29)和因变量之间的关系,看看每个变量和正常或欺诈之间是否有联系。为了更直观的展示,用distplot图一一判断,如下:features=[xforxincrecreditcard_data.columnsifxnotin['Time','Amount','Class']]plt.figure(figsize=(12,28*4))gs=gridspec.GridSpec(28,1)importwarningswarnings.filterwarnings('ignore')fori,cninenumerate(creditcard_data[v_features]):ax=plt.subplot(gs[i])sns.distplot(crecreditcard_data[cn][crecreditcard_data.Class==1],bins=50,color='red')sns.distplot(crecreditcard_data[cn][crecreditcard_data.Class==0],bins=50,color='green')ax.set_xlabel('')ax.set_title('Histogram:'+str(cn))plt.savefig('variablesRelationshipwithclass.png',transparent=False,bbox_inches='tight')plt.show()红色表示欺诈,绿色意味着正常。两个分布的交集面积越大,欺诈与正态的区分度最小,如V15;两个分布的交集面积越小,变量对因变量的影响越大,比如V14。下面我们来看一下各个单变量与类之间的相关性分析。为了更直观的展示,直接画个图,如下:#每个变量的矩阵分布creditcard_data.hist(figsize=(15,15),bins=50)plt.show()6.建模分析的三种方法本部分将应用逻辑回归、随机森林、支持向量机支持向量机三种方法进行建模分析,分别如下:准备数据:#首先将数据分为欺诈组和正常组,然后产生按比例训练和测试数据集#groupFraud=crecreditcard_data[crecreditcard_data.Class==1]Normal=crecreditcard_data[crecreditcard_data.Class==0]#trainingfeaturesetx_train=Fraud.sample(frac=0.7)x_train=pd.concat([x_train,Normal.sample(frac=0.7)],axis=0)#测试特征集x_test=crecreditcard_data.loc[~crecreditcard_data.index.isin(x_train.index)]#labelsety_train=x_train.Classy_test=x_test.Class#去掉特征集中的标签和时间列x_train=x_train.drop(['Class','Time'],axis=1)x_test=x_test.drop(['Class','Time'],axis=1)#查看数据结构print(x_train.shape,y_train.shape,'\n',x_test.shape,y_test.shape)(199364,29)(199364,)(85443,29)(85443,)6.1logistic来自sklearnim的回归方法portmetricsimportscipy.optimizeasopfromsklearn.linear_modelimportLogisticRegressionfromsklearn.cross_validationimportKFold,cross_val_scorefromsklearn.metricsimport(precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report)lrmodel=LogisticRegression(penalty='l2')lrmodel.fit(x_train,y_train)#查看模型print('lrmodel')print(lrmodel)lrmodelLogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,max_iter=100,multi_class='ovr',n_jobs=1,penalty='l2',random_state=None,solver='liblinear',tol=0.0001,verbose=0,warm_start=False)#查看混淆矩阵ypred_lr=lrmodel.predict(x_test)print('confusion_matrix')print(metrics.confusion_matrix(y_test,ypred_lr))confusion_matrix[[8528411][5692]]#查看分类报告print('classification_report')print(metrics.classification_report(y_test,ypred_lr))classification_reportprecisionrecallf1-scoresupport01.001.001.008529510.890.620.73148avg/total1.001.001.0085443#查看预测精度和决策覆盖率print('Accuracy:%fac'%(,ypred_lr)))print('Areaunderthecurve:%f'%(metrics.roc_auc_score(y_test,ypred_lr)))Accuracy:0.999216Areaunderthecurve:0.8107466.2Randomforest模型来自sklearn.ensembleimportRandomForestClassifierrfmodel=RandomForestClassifier()rfmodel.fit(x_train,y_train)#查看模型print('rfmodel')rfmodelrfmodelRandomForestClassifier(bootstrap=True,class_weight=None,criterion='gini',max_depth=None,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=10,n_jobs=1,oob_score=False,random_state=None,verbose=0,warm_start=False)#查看淘淘ypred_rf=rfmodel.predict(x_test)print('confusion_matrix')print(metrics.confusion_matrix(y_test,ypred_rf))confusion_matrix[[852914][34114]]#查看分类报告print('classification_report')print(metrics.classification_report(y_test,ypred_rf))分类_reportprecisionRecallf1-scoresupport01.001.001.001.001.008529510.970.770.770.86148avg/tocure1.001.001.001.001.001.001.0085443#'Areaunderthecurve:%f'%(metrics.roc_auc_score(y_test,ypred_rf)))精度:0.999625Areaunderthecurve:0.9020096.3支持向量机SVM#SVM分类来自sklearn.svmimportSVCsvcmodel=SVC(kernel='sigmoid')svcmodel.fit(x_train,y_train)#Viewmodelprint('svcmodel')svcmodelSVC(C=1.0,cache_size=200,class_weight=None,coef0=0.0,decision_function_shape='ovr',degree=3,gamma='auto',kernel='sigmoid',max_iter=-1,probability=False,random_state=None,shrinking=True,tol=0.001,verbose=False)#查看混淆矩阵ypred_svc=svcmodel.predict(x_test)print('confusion_matrix')print(metrics.confusion_matrix(y_test,ypred_svc))confusion_matrix[[8519798][1426]]#查看分类报告print('classification_report')print(metrics.classification_report(y_test,ypred_svc))classification_reportprecisionrecallf1-scoresupport01.001.001.008529510.060.040.05148avg/total1.001.001.0085443#查看预测准确率和决策覆盖率print('f.'%'%accuracy_score(y_test,ypred_svc)))print('Areaunderthecurve:%f'%(metrics.roc_auc_score(y_test,ypred_svc)))Accuracy:0.997191Areaunderthecurve:0.5196967、Summary通过三个模型的表现,随机森林误杀率最低;不要只关注准确率,有时候模型准确率高并不代表模型好,尤其像这个项目例如,在这种数据严重不平衡的情况下,我们有1000名患者的数据集,其中990名是健康的,10名患有癌症。我们需要通过建模找到这10位癌症患者。如果一个模型预测了所有990个健康人,但是没有找到10个病人。这时候正确率还是99%,但是这个模型没什么用,没有达到我们找病人的目的;在建模和分析的时候,我们遇到了像这个例子这样极度不平衡的数据集,通过采用降采样、过采样等方法来平衡数据,进行预测是有意义的。下一篇文章将改善这个问题;模型和算法没有高低之分,好坏的区别只是在不同情况下的表现不同,这个要正确看待。