基于數(shù)據(jù)挖掘的個人信用評分建模與分析
發(fā)布時間:2019-04-11 07:22
【摘要】:隨著經(jīng)濟的不斷發(fā)展,人們對住房、汽車、教育、日常消費等有信貸需求的家庭也越來越多。因此對于金融機構(gòu)如何規(guī)避潛在的個人信用風(fēng)險是銀行和信貸機構(gòu)面臨的重大挑戰(zhàn)。所以使用統(tǒng)計方法或數(shù)據(jù)挖掘技術(shù),建立個人信用貸款模型,能夠比較準確的預(yù)測個人違約的概率,對銀行或金融機構(gòu)有著重要的意義。個人信用貸款預(yù)測實質(zhì)上是需要我們找到一種分類模型,即將個體消費者劃分為能夠按期還本付息(即“好”客戶)和違約(即“壞”客戶)兩類。對于此類問題,本文選擇Logistic回歸和決策樹分類方法進行建模并比較兩者之間的優(yōu)缺點,選擇最優(yōu)模型。本文以kaggle競賽數(shù)據(jù)為實證數(shù)據(jù)結(jié)合SAS、SPSS軟件進行論文研究,首先結(jié)合SAS軟件對原始數(shù)據(jù)進行隨機抽樣,分成訓(xùn)練集、驗證集和測試集三個數(shù)據(jù)集,接著對數(shù)據(jù)集進行預(yù)處理,對缺失值、異常值進行檢驗和多重共線性檢驗,并相應(yīng)使用插補法和變量聚類分析進行變量篩選得到處理后的數(shù)據(jù)集,最后從xl-x10十個變量中篩選出五個變量x1、x2、x4、x8、x9進行Logistic回歸建模;然后通過 Logistic回歸分析中的全模型法得到三個候選模型,對三個候選模型進行參數(shù)估計以及模型顯著性檢驗擬合數(shù)據(jù)得到兩個預(yù)測模型,且計算得出兩個模型AUC統(tǒng)計量都為0.714,說明模型預(yù)測效果較為理想,為了進一步選擇穩(wěn)健性高、簡潔的最優(yōu)模型,再通過驗證集繪制ROC曲線以及計算AUC值,兩模型在驗證數(shù)據(jù)集中AUC值都超過了70%,最后綜合比較得到最優(yōu)模型,篩選出x2、x8、x9建立Logistic回歸模型;接著結(jié)合SPSS軟件對訓(xùn)練集使用Exhaustive CHAID算法建立決策樹分類模型,篩選出x1、x3、x4、x7、x9五個變量,然后通過驗證集檢驗?zāi)P偷姆(wěn)健性,得到AUC值為0.839,說明模型有很好的穩(wěn)健性;最后通過測試集比較Logistic回歸模型和決策樹分類模型預(yù)測效果,Logistic回歸模型與決策樹分類模型預(yù)測違約概率p與實際值誤差平方和分別為823.298和231.559,說明在模型的預(yù)測準確度、穩(wěn)健性上,決策樹模型都優(yōu)于Logistic回歸模型。
[Abstract]:With the continuous development of the economy, there are more and more families in need of credit such as housing, cars, education, daily consumption and so on. Therefore, how to avoid the potential personal credit risk for financial institutions is a major challenge for banks and credit institutions. Therefore using statistical method or data mining technology to establish personal credit loan model can accurately predict the probability of personal default which is of great significance to banks or financial institutions. In essence, the forecast of personal credit needs us to find a classification model, that is, individual consumers can be divided into two categories, namely, "good" customers and "bad" customers, who can pay their debts on schedule (that is, "good" customers) and default ("bad" customers). For this kind of problem, this paper chooses Logistic regression and decision tree classification method to model, compares the advantages and disadvantages of the two methods, and chooses the optimal model. In this paper, kaggle competition data is used as empirical data and SAS,SPSS software is used to carry on the research. Firstly, the original data are randomly sampled with SAS software, and divided into three data sets: training set, verification set and test set, and then the data set is preprocessed. The missing value and abnormal value are tested and multi-collinearity test is carried out, and the data set is selected by interpolation and variable cluster analysis. Finally, five variables x 1, x 2, x 4 are selected from the ten variables of xl-x10. X8, x9 for Logistic regression modeling; Then three candidate models are obtained by the full model method of Logistic regression analysis. The parameters of three candidate models are estimated and the model significance test data are fitted to get two prediction models. The AUC statistics of the two models are both 0.714, which shows that the prediction effect of the model is ideal. In order to select the best model with high robustness and simplicity, the ROC curve is drawn by the verification set and the AUC value is calculated. The AUC value of the two models is over 70% in the verification data set. Finally, the optimal model is obtained by comprehensive comparison, and the Logistic regression model is established by selecting x2, x8 and x9. Then using Exhaustive CHAID algorithm to set up a decision tree classification model with SPSS software, five variables x 1, x 3, x 4, x 7, x 9 were screened out, and then the robustness of the model was verified by verifying the robustness of the model, and the AUC value was 0.839, and the value of AUC was 0.839. It shows that the model has good robustness; Finally, the prediction results of Logistic regression model and decision tree classification model are compared by test set. The sum of square of the error between Logistic regression model and decision tree classification model is 823.298 and 231.559, respectively. It is shown that the decision tree model is superior to the Logistic regression model in the prediction accuracy and robustness of the model.
【學(xué)位授予單位】:華中師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP311.13
本文編號:2456202
[Abstract]:With the continuous development of the economy, there are more and more families in need of credit such as housing, cars, education, daily consumption and so on. Therefore, how to avoid the potential personal credit risk for financial institutions is a major challenge for banks and credit institutions. Therefore using statistical method or data mining technology to establish personal credit loan model can accurately predict the probability of personal default which is of great significance to banks or financial institutions. In essence, the forecast of personal credit needs us to find a classification model, that is, individual consumers can be divided into two categories, namely, "good" customers and "bad" customers, who can pay their debts on schedule (that is, "good" customers) and default ("bad" customers). For this kind of problem, this paper chooses Logistic regression and decision tree classification method to model, compares the advantages and disadvantages of the two methods, and chooses the optimal model. In this paper, kaggle competition data is used as empirical data and SAS,SPSS software is used to carry on the research. Firstly, the original data are randomly sampled with SAS software, and divided into three data sets: training set, verification set and test set, and then the data set is preprocessed. The missing value and abnormal value are tested and multi-collinearity test is carried out, and the data set is selected by interpolation and variable cluster analysis. Finally, five variables x 1, x 2, x 4 are selected from the ten variables of xl-x10. X8, x9 for Logistic regression modeling; Then three candidate models are obtained by the full model method of Logistic regression analysis. The parameters of three candidate models are estimated and the model significance test data are fitted to get two prediction models. The AUC statistics of the two models are both 0.714, which shows that the prediction effect of the model is ideal. In order to select the best model with high robustness and simplicity, the ROC curve is drawn by the verification set and the AUC value is calculated. The AUC value of the two models is over 70% in the verification data set. Finally, the optimal model is obtained by comprehensive comparison, and the Logistic regression model is established by selecting x2, x8 and x9. Then using Exhaustive CHAID algorithm to set up a decision tree classification model with SPSS software, five variables x 1, x 3, x 4, x 7, x 9 were screened out, and then the robustness of the model was verified by verifying the robustness of the model, and the AUC value was 0.839, and the value of AUC was 0.839. It shows that the model has good robustness; Finally, the prediction results of Logistic regression model and decision tree classification model are compared by test set. The sum of square of the error between Logistic regression model and decision tree classification model is 823.298 and 231.559, respectively. It is shown that the decision tree model is superior to the Logistic regression model in the prediction accuracy and robustness of the model.
【學(xué)位授予單位】:華中師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前6條
1 董艷;;數(shù)據(jù)預(yù)處理方法在移動通信行業(yè)中的應(yīng)用[J];計算機技術(shù)與發(fā)展;2010年11期
2 丁娟娟;崔媛媛;;個人信用評估模型的比較研究[J];商場現(xiàn)代化;2007年15期
3 徐少鋒;;FISHER判別分析在個人信用評估中的應(yīng)用[J];統(tǒng)計與決策;2006年02期
4 李建平,徐偉宣,劉京禮,石勇;消費者信用評估中支持向量機方法研究[J];系統(tǒng)工程;2004年10期
5 朱興德,馮鐵軍;基于GA神經(jīng)網(wǎng)絡(luò)的個人信用評估[J];系統(tǒng)工程理論與實踐;2003年12期
6 石慶焱,靳云匯;個人信用評分的主要模型與方法綜述[J];統(tǒng)計研究;2003年08期
,本文編號:2456202
本文鏈接:http://sikaile.net/jingjilunwen/jiliangjingjilunwen/2456202.html
最近更新
教材專著