分類數(shù)據(jù)中高維列聯(lián)表可壓縮性研究
發(fā)布時間:2018-05-30 04:21
本文選題:列聯(lián)表壓縮 + 辛普森悖論。 參考:《廈門大學(xué)》2014年碩士論文
【摘要】:分類數(shù)據(jù)的統(tǒng)計分析方法是分析名義數(shù)據(jù)和有序數(shù)據(jù)的重要工具,在分類數(shù)據(jù)分析中,用列聯(lián)表對數(shù)據(jù)進行分析是一種常用、直觀的方法,例如,醫(yī)學(xué)研究者按年齡和性別對病例進行分類建立列聯(lián)表:教育工作研究人員按年齡、性別和家庭背景對學(xué)生進行分類建立列聯(lián)表;經(jīng)濟研究者按照行業(yè)、地區(qū)、初始投資對企業(yè)成敗進行分類建立列聯(lián)表:市場研究者按年齡、性別和對商品的消費傾向進行分類建立列聯(lián)表等。 傳統(tǒng)的分類數(shù)據(jù)分析方法主要是對列聯(lián)表進行獨立性檢驗,隨著對數(shù)線性模型的提出以及廣泛應(yīng)用,使得分類數(shù)據(jù)分析方法經(jīng)常用于分析高維列聯(lián)表,但是國內(nèi)外文獻中缺少對高維列聯(lián)表的詳細分析方法。由于高維列聯(lián)表數(shù)據(jù)資料的復(fù)雜性,在分析高維列聯(lián)表的時候為了更好地分析數(shù)據(jù)中變量的相關(guān)性,需要通過一些方式對列聯(lián)表進行降維,也即對列聯(lián)表中變量進行壓縮,但不合理的壓縮會導(dǎo)致辛普森悖論、虛假相關(guān)、虛假獨立三種現(xiàn)象的產(chǎn)生,這就增大了分析列聯(lián)表的難度,所以研究列聯(lián)表可壓縮性的方法非常重要,國內(nèi)外學(xué)者對三維列聯(lián)表已經(jīng)有些研究,但仍缺少對高維列聯(lián)表的可壓縮性方面的研究。 本文通過基于交互作用與互信息、信息熵三種角度對列聯(lián)表的可壓縮性進行分析研究,深入探討高維列聯(lián)表可壓縮的條件和實現(xiàn)途徑,研究發(fā)現(xiàn): 1、對于三維列聯(lián)表只要滿足變量之間存在條件獨立列聯(lián)表就可壓縮,但對于四維列聯(lián)表,盡管變量之間存在條件獨立并不能保證列聯(lián)表可壓縮; 2、基于交互作用的對數(shù)線性模型與基于互信息的線性信息模型之間存在等價條件,兩種模型分析的結(jié)果可以互相利用; 3、給出了線性信息模型設(shè)定條件變量與不設(shè)定條件變量的模型選擇方法,發(fā)現(xiàn)所擬合的線性信息模型比對數(shù)線性模型更加簡潔,在交互作用下的模型顯示不可壓縮,但在互信息下的模型顯示可以壓縮; 4、給出了基于互信息和信息熵列聯(lián)表變量可壓縮的方法,發(fā)現(xiàn)基于互信息的可壓縮性方法是在考慮了變量相關(guān)性的角度對列聯(lián)表進行的壓縮,在壓縮過程中允許損失部分不顯著的相關(guān)信息;基于信息熵的可壓縮性方法是在考慮變量含有不確定信息的多少而對列聯(lián)表進行的壓縮,在壓縮的過程中不允許損失變量的任何信息; 5、給出了兩種分別基于互信息和信息熵對列聯(lián)表變量重要性的排序方法,發(fā)現(xiàn)從列聯(lián)表可壓縮性的角度,基于互信息的變量重要性排序方法更加準(zhǔn)確。而從變量含有的不確定信息多少的角度,基于信息熵的變量重要性排序方法更加準(zhǔn)確。 研究的成果對分類數(shù)據(jù)分析方法的研究深入發(fā)展做出新的貢獻,對高維列聯(lián)表的可壓縮性方法提供了一些重要可實現(xiàn)的途徑。
[Abstract]:Statistical analysis of classified data is an important tool for analyzing nominal and ordered data. In the analysis of classified data, it is a common and intuitive method to use column tables to analyze data, such as, Medical researchers classified cases according to age and sex. Educational researchers classified students according to age, sex and family background. The initial investment classifies the success or failure of the enterprise. The market researcher classifies the success or failure of the enterprise by age, sex and the consumption tendency of the commodity. The traditional classification data analysis method is mainly to test the independence of the column table. With the development of the logarithmic linear model and its wide application, the classification data analysis method is often used to analyze the high-dimensional column table. However, there is a lack of detailed analysis method of high-dimensional table in domestic and foreign literature. Because of the complexity of the data in the high-dimensional column table, in order to better analyze the correlation of variables in the data, it is necessary to reduce the dimension of the column table by some means, that is, to compress the variables in the column table. However, unreasonable compression will lead to three phenomena: Simpson paradox, false correlation and false independence, which increase the difficulty of analyzing the table, so it is very important to study the compressibility of the list. Scholars at home and abroad have done some research on the three-dimensional table, but there is still a lack of research on the compressibility of the high-dimensional table. In this paper, based on interaction and mutual information, information entropy is used to analyze the compressibility of the column table, and the conditions and the way to realize the compressibility of the high dimensional table are discussed in depth. The results show that: 1. As long as the conditional independent column table between the variables is satisfied, the three dimensional column table can be compressed, but for the four dimensional column table, although the conditional independence between the variables can not guarantee the compressibility of the column coupling table; (2) there are equivalent conditions between the logarithmic linear model based on interaction and the linear information model based on mutual information, and the results of the two models can be used mutually; 3. A model selection method for linear information model with or without conditional variables is given. It is found that the fitted linear information model is more concise than the logarithmic linear model, and the model under interaction is incompressible. But the model display under mutual information can be compressed; 4. A compressible method based on mutual information and information entropy is given. It is found that the compressibility method based on mutual information is the compression of the column table considering the correlation of variables. The compressibility method based on information entropy is to compress the column table considering how much uncertain information the variable contains. No information about lost variables is allowed during compression; 5. Two sorting methods based on mutual information and information entropy to rank the importance of column table variables are presented, and it is found that the method based on mutual information is more accurate from the point of view of column table compressibility. From the point of view of the uncertain information contained in variables, the importance ranking method based on information entropy is more accurate. The results of the study make a new contribution to the further development of the analytical methods of classified data, and provide some important and feasible ways for the compressibility method of high dimensional column tables.
【學(xué)位授予單位】:廈門大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:C81
【參考文獻】
相關(guān)期刊論文 前5條
1 唐先勇;3—維列聯(lián)表中對數(shù)線性模型的選擇策略[J];零陵學(xué)院學(xué)報;2003年S1期
2 李開燦;列聯(lián)表中輔助交互作用的可壓縮性[J];應(yīng)用概率統(tǒng)計;1998年02期
3 郭建華,馬文卿;輔助交互作用的有序可壓縮性[J];應(yīng)用概率統(tǒng)計;2001年01期
4 張巖波,何大衛(wèi);對數(shù)線性模型的最優(yōu)模型篩選策略[J];中國衛(wèi)生統(tǒng)計;1996年06期
5 程中興;;非線性視角下辛普森悖論的統(tǒng)計解釋[J];統(tǒng)計科學(xué)與實踐;2011年01期
,本文編號:1953902
本文鏈接:http://sikaile.net/shekelunwen/shgj/1953902.html
最近更新
教材專著