復(fù)雜資料綜合投影尋蹤回歸分析法與綜合傳統(tǒng)回歸分析法的比較研究
本文選題:投影尋蹤 + 綜合傳統(tǒng)回歸分析。 參考:《中國人民解放軍軍事醫(yī)學(xué)科學(xué)院》2017年碩士論文
【摘要】:高維數(shù)據(jù)統(tǒng)計(jì)分析在現(xiàn)在的醫(yī)學(xué)科學(xué)研究中越來越普遍,數(shù)據(jù)的高維問題使得傳統(tǒng)的多元統(tǒng)計(jì)分析方法遇到了一些問題,如高維數(shù)據(jù)計(jì)算量大、出現(xiàn)維數(shù)禍根、低維穩(wěn)健性很好的統(tǒng)計(jì)分析方法在高維時(shí)穩(wěn)健性變差等。傳統(tǒng)的分析方法遠(yuǎn)不能滿足高維數(shù)據(jù)分析的需要,尤其是當(dāng)高維數(shù)據(jù)分布為非正態(tài)時(shí),原有建立在服從正態(tài)分布基礎(chǔ)上的多元統(tǒng)計(jì)分析方法更顯得無能為力。在此背景下,投影尋蹤在上世紀(jì)60~70年代開始出現(xiàn)。為了分析或研究高維數(shù)據(jù),投影尋蹤將高維數(shù)據(jù)投影到可反映其原始數(shù)據(jù)結(jié)構(gòu)或特征的低維空間(1~3維)上,用投影指標(biāo)來度量投影分布所含信息的多少。故投影尋蹤關(guān)鍵在于找到投影指標(biāo)取值最大或最小時(shí)的投影方向,而目前多采用遺傳算法來尋找最優(yōu)投影方向。將投影尋蹤與回歸分析技術(shù)相結(jié)合就形成了投影尋蹤回歸分析技術(shù)。本研究旨在通過對(duì)同一復(fù)雜資料,分別采用投影尋蹤回歸分析法和傳統(tǒng)回歸分析法進(jìn)行分析,然后比較二者的擬合效果和預(yù)測(cè)效果,以研究出對(duì)此資料更適合采用哪種分析方法。本研究可使投影尋蹤回歸的適用性更為具體,也可引起醫(yī)學(xué)統(tǒng)計(jì)學(xué)數(shù)據(jù)分析者對(duì)投影尋蹤這一方法的認(rèn)識(shí),從而有利于今后進(jìn)行復(fù)雜資料回歸分析時(shí)方法的合理選擇。本文中所用的投影尋蹤回歸分析方法主要包括R中所能實(shí)現(xiàn)的投影尋蹤回歸方法(PPR包中包括的三種方法,Spline法、Gcvspline法、Supsmu法)和自行編制的投影尋蹤回歸軟件中使用的方法(Hermite多項(xiàng)式法)。在本文中綜合傳統(tǒng)回歸分析法主要指多重線性回歸分析、主成分回歸、嶺回歸、偏最小二乘回歸和穩(wěn)健回歸。本課題研究中關(guān)于“復(fù)雜資料”的界定包括以下2種情形:第一種情形:自變量之間存在多重共線性關(guān)系。對(duì)于多重共線性,本文中傳統(tǒng)回歸分析方法采用主成分回歸、嶺回歸和偏最小二乘回歸處理;具體計(jì)算,將通過SAS中REG、PRINCOMP和PLS過程來實(shí)現(xiàn)。第二種情形:數(shù)據(jù)中存在異常點(diǎn)。對(duì)于存在異常點(diǎn)情形,本文中傳統(tǒng)回歸分析方法采用穩(wěn)健回歸;具體計(jì)算,將通過SAS中ROBUSTREG過程來實(shí)現(xiàn)。本文除考慮進(jìn)行上述復(fù)雜資料情況比較外,也進(jìn)行了對(duì)于數(shù)據(jù)質(zhì)量較好(數(shù)據(jù)本身質(zhì)量較好、不存在多重共線性及異常點(diǎn)等,并且采用多重線性回歸分析擬合及預(yù)測(cè)效果均很好)情況下投影尋蹤回歸分析方法和傳統(tǒng)的多重線性回歸分析方法的比較。本文主要采用決定系數(shù)和相對(duì)誤差絕對(duì)值的平均值來評(píng)價(jià)擬合效果,主要采用各預(yù)測(cè)樣本相對(duì)誤差的絕對(duì)值和預(yù)測(cè)誤差的均方來評(píng)價(jià)預(yù)測(cè)效果。對(duì)于實(shí)際數(shù)據(jù)擬合樣本采用的是原始的樣本數(shù)據(jù),預(yù)測(cè)樣本采用的是對(duì)應(yīng)于相應(yīng)變量的平均值、最大值、最小值、中位數(shù)、四分之一分位數(shù)、四分之三分位數(shù)所形成的6個(gè)統(tǒng)計(jì)量值。經(jīng)本研究發(fā)現(xiàn),當(dāng)實(shí)際數(shù)據(jù)本身質(zhì)量較好時(shí),采用投影尋蹤回歸分析方法在擬合和預(yù)測(cè)效果上均好于多重線性回歸分析方法,不過二者之間的差別不大。用投影尋蹤回歸分析擬合,決定系數(shù)在0.9703~0.9988之間,相對(duì)誤差均值在0.0039~0.0187之間,預(yù)測(cè)樣本的MSE在12.91~16.77之間;用多重線性回歸分析擬合,決定系數(shù)為0.9639,相對(duì)誤差均值為0.0224,預(yù)測(cè)樣本的MSE為18.80。而對(duì)于模擬數(shù)據(jù)本身質(zhì)量較好時(shí),投影尋蹤回歸分析和多重線性回歸分析二者在擬合和預(yù)測(cè)效果上相差很小,難分高下,二者擬合效果評(píng)價(jià)指標(biāo)決定系數(shù)均在0.9942以上。本文分析了三個(gè)自變量間存在共線性的實(shí)際數(shù)據(jù)。對(duì)第一個(gè)存在共線性的數(shù)據(jù)分析結(jié)果為:采用傳統(tǒng)回歸分析方法(主成分回歸、嶺回歸和偏最小二乘回歸)擬合,決定系數(shù)在0.9351~0.9386之間,相對(duì)誤差均值在0.0497~0.0528之間,對(duì)于預(yù)測(cè)樣本的MSE,主成分回歸為1.18,嶺回歸為0.66,PLS回歸為1.14;采用投影尋蹤回歸分析擬合,決定系數(shù)在0.9756~0.9846之間,相對(duì)誤差均值在0.0316~0.0363之間,預(yù)測(cè)樣本的MSE在0.69~0.86之間。對(duì)第二個(gè)存在共線性的數(shù)據(jù)分析結(jié)果為:采用傳統(tǒng)回歸分析方法(主成分回歸、嶺回歸和偏最小二乘回歸)擬合,決定系數(shù)在0.9039~0.9820之間,相對(duì)誤差均值在0.0174~0.0383之間,對(duì)于預(yù)測(cè)樣本的MSE,主成分回歸為126.59,嶺回歸為208.40,PLS回歸為215.82;采用投影尋蹤回歸分析擬合,決定系數(shù)在0.9823~0.9927之間,相對(duì)誤差均值在0.0104~0.0175之間,預(yù)測(cè)樣本的MSE在11.00~27.25之間。對(duì)第三個(gè)存在共線性的數(shù)據(jù)分析結(jié)果為:采用傳統(tǒng)回歸分析方法(主成分回歸、嶺回歸和偏最小二乘回歸)擬合,決定系數(shù)在0.8023~0.8924之間,相對(duì)誤差均值在0.0450~0.0642之間,對(duì)于預(yù)測(cè)樣本的MSE,主成分回歸為0.61,嶺回歸為0.36,PLS回歸為0.23;采用投影尋蹤回歸分析擬合,決定系數(shù)在0.8851~0.9980之間,相對(duì)誤差均值在0.0046~0.0481之間,預(yù)測(cè)樣本的MSE在0.03~0.65之間。本文分析了兩個(gè)數(shù)據(jù)中存在異常點(diǎn)的實(shí)際數(shù)據(jù)。對(duì)第一個(gè)存在異常點(diǎn)的數(shù)據(jù)分析結(jié)果顯示不論是采用投影尋蹤回歸分析還是采用穩(wěn)健回歸分析,對(duì)數(shù)據(jù)的擬合效果都很差。傳統(tǒng)回歸分析,決定系數(shù)最高為0.3641;投影尋蹤回歸分析,決定系數(shù)在0.1857~0.6650之間。對(duì)第二個(gè)存在異常點(diǎn)的數(shù)據(jù)分析結(jié)果為:M回歸決定系數(shù)為0.8982,相對(duì)誤差均值為0.1377,預(yù)測(cè)樣本的MSE為3.3919;投影尋蹤回歸分析,決定系數(shù)在0.9423~0.9563之間,相對(duì)誤差均值在0.0899~0.1138之間,預(yù)測(cè)樣本的MSE在2.3604~3.0308之間。從本文研究結(jié)果可以得出如下結(jié)論:(1)考慮到多重線性回歸分析與投影尋蹤回歸分析對(duì)于數(shù)據(jù)本身質(zhì)量較好時(shí)擬合效果相差不大且擬合決定系數(shù)在0.95以上,并且投影尋蹤回歸分析計(jì)算難于多重線性回歸分析,故在數(shù)據(jù)本身質(zhì)量較好情況下的回歸分析本文推薦采用多重線性回歸方法。(2)可以認(rèn)為,當(dāng)數(shù)據(jù)存在共線性時(shí)采用投影尋蹤回歸分析方法進(jìn)行分析要好于傳統(tǒng)的對(duì)共線性數(shù)據(jù)的處理辦法(主成分回歸、嶺回歸和偏最小二乘回歸)。(3)暫且認(rèn)為當(dāng)數(shù)據(jù)中存在異常點(diǎn)時(shí),采用投影尋蹤回歸分析效果好于穩(wěn)健回歸分析。(4)數(shù)據(jù)本身質(zhì)量非常重要,在科學(xué)研究中要重視科研設(shè)計(jì)(特別是應(yīng)注意找準(zhǔn)找全對(duì)結(jié)果變量有影響的自變量、具有足夠大的樣本含量且樣本對(duì)于總體的代表性足夠好),如果研究者在前期數(shù)據(jù)收集上忽略或遺漏了重要的原因變量,后期通過統(tǒng)計(jì)分析也難以彌補(bǔ)。
[Abstract]:Statistical analysis of high dimensional data is becoming more and more common in medical science research now. The high dimension of data makes traditional multivariate statistical analysis methods meet some problems, such as high dimension of high dimension data, dimension curse, low dimensional robustness, high robustness and robustness, and so on. It can not meet the needs of high dimensional data analysis, especially when the distribution of high dimensional data is non normal, the original multivariate statistical analysis method based on normal distribution is more powerless. In this context, projection pursuit appears in the 60~70 age of last century. In order to analyze and study high dimensional data, the projection pursuit will be high dimension According to the projection to the low dimensional space (1~3 dimension) that can reflect the structure or feature of its original data, the projection index is used to measure the number of information contained in the projection distribution. Therefore, the key of the projection pursuit is to find the projection direction of the maximum or the hourly projection of the projection index, and the genetic algorithm is used to find the optimal projection direction. The regression analysis technique is combined to form a projection pursuit regression analysis technique. The purpose of this study is to analyze the same complex data by projection pursuit regression analysis and traditional regression analysis, and then compare the fitting effect and prediction effect of the two, so as to find out which method is more suitable for this data. The applicability of the projection pursuit regression is more specific, and it can also cause the understanding of the projection pursuit method by the medical statistics data analysts, which is beneficial to the rational selection of the method for the regression analysis of complex data in the future. The projection pursuit regression analysis method used in this paper mainly includes the projection search in R. The trace regression method (three methods included in the PPR package, Spline, Gcvspline, Supsmu) and the method used by the self compiled projection pursuit regression software (Hermite polynomial method). In this paper, the traditional regression analysis method mainly refers to multiple linear regression analysis, principal component regression, ridge regression, partial least squares regression and robust regression. The definition of "complex data" in this study includes the following 2 cases: first, there are multiple collinear relations between independent variables. For multiple collinearity, the traditional regression analysis method in this paper uses principal component regression, ridge regression and partial least squares regression; concrete calculations will pass through REG, PRINCOMP and PLS in SAS. Second cases: there are abnormal points in the data. For the case of abnormal points, the traditional regression analysis method in this paper uses robust regression; the concrete calculation will be realized through the ROBUSTREG process in SAS. In addition to the comparison of the above complex data, the quality of the data is better (the quality of the data itself). Better, there is no multiple collinearity and abnormal points, and the comparison of the projection pursuit regression analysis method and the traditional multiple linear regression analysis method is compared with the multiple linear regression analysis and the prediction effect is good. This paper mainly uses the mean value of the determination coefficient and relative error absolute value to evaluate the fitting effect. The prediction results are evaluated by the absolute value of the relative error of the prediction samples and the mean square of the prediction error. The original sample data are used for the actual data fitting samples. The predicted samples are based on the average, maximum, minimum, median, 1/4 digits and 3/4 digits corresponding to the corresponding variables. It is found that when the quality of the actual data is good, the projection pursuit regression analysis method is better than the multiple linear regression analysis method in the fitting and prediction results, but the difference between the two is not significant. The determination coefficient is between 0.9703~0.9988 and the mean value of relative error with the projection pursuit regression analysis. Between 0.0039~0.0187, the MSE of the predicted sample is between 12.91~16.77 and the multiple linear regression analysis is used. The decision coefficient is 0.9639, the mean of the relative error is 0.0224, the MSE of the predicted sample is 18.80. and the quality of the simulated data itself is good. The projection pursuit regression analysis and the multiweight linear regression analysis are two in the fitting and prediction effect. The difference is very small, it is difficult to divide high, and the determination coefficient of the evaluation index of the two is above 0.9942. This paper analyzes the actual data of the common linear between the three independent variables. The result of the first existence of the common linear data analysis is that the traditional regression analysis method (the principal component return, the ridge regression and partial least squares regression) fitting, is determined. The coefficient is between 0.9351~0.9386, the mean relative error is between 0.0497~0.0528, for the MSE of the predicted sample, the principal component regression is 1.18, the ridge regression is 0.66, the PLS regression is 1.14, and the projection pursuit regression analysis is used to determine the coefficient between 0.9756~0.9846, the relative error is between 0.0316~0.0363, and the MSE in the prediction sample is 0.69~0.86. The results of the analysis of second existing co linear data are: using the traditional regression analysis method (principal component regression, ridge regression and partial least square regression) fitting, the determining coefficient is between 0.9039~0.9820, the mean relative error is between 0.0174~0.0383, the MSE of the predicted sample, the principal component regression 126.59, the ridge regression 208.40, the PLS regression. For 215.82, using the projection pursuit regression analysis fitting, the determining coefficient is between 0.9823~0.9927, the mean relative error is between 0.0104~0.0175 and the MSE of the sample is between 11.00~27.25. The data analysis results for the third existing collinearity are: the traditional regression analysis method (principal component regression, ridge regression and partial least squares regression) is proposed. The coefficient of determination is between 0.8023~0.8924, the mean value of relative error is between 0.0450~0.0642, for the MSE of the predicted sample, the principal component regression is 0.61, the ridge regression is 0.36, the PLS regression is 0.23, and the projection pursuit regression analysis is used to determine the coefficient between 0.8851~0.9980, the mean of phase to error is 0.0046~0.0481, and the MSE of the prediction sample is 0. 3~0.65. This paper analyzes the actual data of the exception point in the two data. The data analysis results for the first abnormality point show that both the projection pursuit regression analysis or the robust regression analysis are used, the results of the data are very poor. The maximum coefficient of the traditional regression analysis is 0.3641; the projection pursuit regression is the most important. Analysis, the determination coefficient is between 0.1857~0.6650. The results of data analysis for second abnormality points are: M regression determination coefficient is 0.8982, relative error mean value is 0.1377, MSE of prediction sample is 3.3919; projection pursuit regression analysis, determining coefficient is between 0.9423~0.9563, relative error mean value is between 0.0899~0.1138, prediction sample MSE Between 2.3604~3.0308. From the results of this study, we can draw the following conclusions: (1) considering that multiple linear regression analysis and projection pursuit regression analysis have little difference in the fitting effect when the quality of the data is better and the fitting decision coefficient is more than 0.95, and the projection pursuit return analysis is difficult to multiply linear regression analysis. Regression analysis under the good quality of data itself is recommended by multiple linear regression methods. (2) it is considered that the projection pursuit regression analysis method is better than the traditional methods of processing common linear data (principal component return, ridge regression and partial least square regression). (3) The effect of projection pursuit regression analysis is better than robust regression analysis. (4) the quality of the data itself is very important. In scientific research, it is important to pay attention to the design of scientific research (especially the independent variable which should be paid attention to finding all the result variables, with a large enough sample content and the sample for the overall representation. " Good enough. If researchers ignore or omit important causal variables in previous data collection, it will be difficult to make up for later analysis by statistical analysis.
【學(xué)位授予單位】:中國人民解放軍軍事醫(yī)學(xué)科學(xué)院
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:O212.1;R195.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 高科;劉劍;劉玉姣;;回采工作面瓦斯涌出量遺傳投影尋蹤回歸預(yù)測(cè)[J];中國安全科學(xué)學(xué)報(bào);2015年03期
2 王金龍;黃煒斌;馬光文;趙云發(fā);譚磊;;梯級(jí)水電站群聯(lián)合優(yōu)化調(diào)度規(guī)則制定的投影尋蹤回歸法[J];水力發(fā)電學(xué)報(bào);2015年02期
3 朱玲玲;牧振偉;楊力行;;懸柵消能工均勻正交設(shè)計(jì)及投影尋蹤回歸試驗(yàn)研究[J];水資源與水工程學(xué)報(bào);2014年06期
4 李祚泳;劉韻;汪嘉楊;;基于指標(biāo)規(guī)范值的水安全評(píng)價(jià)的投影尋蹤回歸模型[J];水利水電技術(shù);2014年07期
5 蘇屹;姜雪松;張成功;;投影尋蹤法在企業(yè)評(píng)價(jià)體系中的應(yīng)用綜述[J];科技和產(chǎn)業(yè);2013年11期
6 何建新;郭鵬飛;劉錄錄;楊力行;;陽離子乳化瀝青混凝土配合比設(shè)計(jì)的優(yōu)選方法研究[J];水利與建筑工程學(xué)報(bào);2013年03期
7 劉錄錄;何建新;劉亮;楊力行;;膠凝砂礫石材料抗壓強(qiáng)度影響因素及規(guī)律研究[J];混凝土;2013年03期
8 李祚泳,鄧新民,侯宇光;投影尋蹤回歸技術(shù)在降水量預(yù)測(cè)中的應(yīng)用[J];高原氣象;1998年03期
9 李祚泳,鄧新民,桑華民;臺(tái)風(fēng)登陸華南年頻次的投影尋蹤回歸預(yù)測(cè)模型[J];熱帶氣象學(xué)報(bào);1998年02期
10 李祚泳;污染物濃度預(yù)測(cè)的PPR模型[J];環(huán)境科學(xué);1997年04期
,本文編號(hào):1801955
本文鏈接:http://sikaile.net/kejilunwen/yysx/1801955.html