當(dāng)前位置：主頁 > 社科論文 > 社會(huì)學(xué)論文 >

分類數(shù)據(jù)中高維列聯(lián)表可壓縮性研究

發(fā)布時(shí)間：2018-05-30 04:21

本文選題：列聯(lián)表壓縮 + 辛普森悖論　；參考：《廈門大學(xué)》2014年碩士論文

【摘要】：分類數(shù)據(jù)的統(tǒng)計(jì)分析方法是分析名義數(shù)據(jù)和有序數(shù)據(jù)的重要工具,在分類數(shù)據(jù)分析中,用列聯(lián)表對數(shù)據(jù)進(jìn)行分析是一種常用、直觀的方法,例如,醫(yī)學(xué)研究者按年齡和性別對病例進(jìn)行分類建立列聯(lián)表：教育工作研究人員按年齡、性別和家庭背景對學(xué)生進(jìn)行分類建立列聯(lián)表；經(jīng)濟(jì)研究者按照行業(yè)、地區(qū)、初始投資對企業(yè)成敗進(jìn)行分類建立列聯(lián)表：市場研究者按年齡、性別和對商品的消費(fèi)傾向進(jìn)行分類建立列聯(lián)表等。傳統(tǒng)的分類數(shù)據(jù)分析方法主要是對列聯(lián)表進(jìn)行獨(dú)立性檢驗(yàn),隨著對數(shù)線性模型的提出以及廣泛應(yīng)用,使得分類數(shù)據(jù)分析方法經(jīng)常用于分析高維列聯(lián)表,但是國內(nèi)外文獻(xiàn)中缺少對高維列聯(lián)表的詳細(xì)分析方法。由于高維列聯(lián)表數(shù)據(jù)資料的復(fù)雜性,在分析高維列聯(lián)表的時(shí)候?yàn)榱烁玫胤治鰯?shù)據(jù)中變量的相關(guān)性,需要通過一些方式對列聯(lián)表進(jìn)行降維,也即對列聯(lián)表中變量進(jìn)行壓縮,但不合理的壓縮會(huì)導(dǎo)致辛普森悖論、虛假相關(guān)、虛假獨(dú)立三種現(xiàn)象的產(chǎn)生,這就增大了分析列聯(lián)表的難度,所以研究列聯(lián)表可壓縮性的方法非常重要,國內(nèi)外學(xué)者對三維列聯(lián)表已經(jīng)有些研究,但仍缺少對高維列聯(lián)表的可壓縮性方面的研究。本文通過基于交互作用與互信息、信息熵三種角度對列聯(lián)表的可壓縮性進(jìn)行分析研究,深入探討高維列聯(lián)表可壓縮的條件和實(shí)現(xiàn)途徑,研究發(fā)現(xiàn)： 1、對于三維列聯(lián)表只要滿足變量之間存在條件獨(dú)立列聯(lián)表就可壓縮,但對于四維列聯(lián)表,盡管變量之間存在條件獨(dú)立并不能保證列聯(lián)表可壓縮； 2、基于交互作用的對數(shù)線性模型與基于互信息的線性信息模型之間存在等價(jià)條件,兩種模型分析的結(jié)果可以互相利用； 3、給出了線性信息模型設(shè)定條件變量與不設(shè)定條件變量的模型選擇方法,發(fā)現(xiàn)所擬合的線性信息模型比對數(shù)線性模型更加簡潔,在交互作用下的模型顯示不可壓縮,但在互信息下的模型顯示可以壓縮； 4、給出了基于互信息和信息熵列聯(lián)表變量可壓縮的方法,發(fā)現(xiàn)基于互信息的可壓縮性方法是在考慮了變量相關(guān)性的角度對列聯(lián)表進(jìn)行的壓縮,在壓縮過程中允許損失部分不顯著的相關(guān)信息；基于信息熵的可壓縮性方法是在考慮變量含有不確定信息的多少而對列聯(lián)表進(jìn)行的壓縮,在壓縮的過程中不允許損失變量的任何信息； 5、給出了兩種分別基于互信息和信息熵對列聯(lián)表變量重要性的排序方法,發(fā)現(xiàn)從列聯(lián)表可壓縮性的角度,基于互信息的變量重要性排序方法更加準(zhǔn)確。而從變量含有的不確定信息多少的角度,基于信息熵的變量重要性排序方法更加準(zhǔn)確。研究的成果對分類數(shù)據(jù)分析方法的研究深入發(fā)展做出新的貢獻(xiàn),對高維列聯(lián)表的可壓縮性方法提供了一些重要可實(shí)現(xiàn)的途徑。
[Abstract]:Statistical analysis of classified data is an important tool for analyzing nominal and ordered data. In the analysis of classified data, it is a common and intuitive method to use column tables to analyze data, such as, Medical researchers classified cases according to age and sex. Educational researchers classified students according to age, sex and family background. The initial investment classifies the success or failure of the enterprise. The market researcher classifies the success or failure of the enterprise by age, sex and the consumption tendency of the commodity. The traditional classification data analysis method is mainly to test the independence of the column table. With the development of the logarithmic linear model and its wide application, the classification data analysis method is often used to analyze the high-dimensional column table. However, there is a lack of detailed analysis method of high-dimensional table in domestic and foreign literature. Because of the complexity of the data in the high-dimensional column table, in order to better analyze the correlation of variables in the data, it is necessary to reduce the dimension of the column table by some means, that is, to compress the variables in the column table. However, unreasonable compression will lead to three phenomena: Simpson paradox, false correlation and false independence, which increase the difficulty of analyzing the table, so it is very important to study the compressibility of the list. Scholars at home and abroad have done some research on the three-dimensional table, but there is still a lack of research on the compressibility of the high-dimensional table. In this paper, based on interaction and mutual information, information entropy is used to analyze the compressibility of the column table, and the conditions and the way to realize the compressibility of the high dimensional table are discussed in depth. The results show that: 1. As long as the conditional independent column table between the variables is satisfied, the three dimensional column table can be compressed, but for the four dimensional column table, although the conditional independence between the variables can not guarantee the compressibility of the column coupling table; (2) there are equivalent conditions between the logarithmic linear model based on interaction and the linear information model based on mutual information, and the results of the two models can be used mutually; 3. A model selection method for linear information model with or without conditional variables is given. It is found that the fitted linear information model is more concise than the logarithmic linear model, and the model under interaction is incompressible. But the model display under mutual information can be compressed; 4. A compressible method based on mutual information and information entropy is given. It is found that the compressibility method based on mutual information is the compression of the column table considering the correlation of variables. The compressibility method based on information entropy is to compress the column table considering how much uncertain information the variable contains. No information about lost variables is allowed during compression; 5. Two sorting methods based on mutual information and information entropy to rank the importance of column table variables are presented, and it is found that the method based on mutual information is more accurate from the point of view of column table compressibility. From the point of view of the uncertain information contained in variables, the importance ranking method based on information entropy is more accurate. The results of the study make a new contribution to the further development of the analytical methods of classified data, and provide some important and feasible ways for the compressibility method of high dimensional column tables.
【學(xué)位授予單位】：廈門大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2014
【分類號(hào)】：C81

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 唐先勇;3—維列聯(lián)表中對數(shù)線性模型的選擇策略[J];零陵學(xué)院學(xué)報(bào);2003年S1期

2 李開燦;列聯(lián)表中輔助交互作用的可壓縮性[J];應(yīng)用概率統(tǒng)計(jì);1998年02期

3 郭建華,馬文卿;輔助交互作用的有序可壓縮性[J];應(yīng)用概率統(tǒng)計(jì);2001年01期

4 張巖波，何大衛(wèi);對數(shù)線性模型的最優(yōu)模型篩選策略[J];中國衛(wèi)生統(tǒng)計(jì);1996年06期

5 程中興;;非線性視角下辛普森悖論的統(tǒng)計(jì)解釋[J];統(tǒng)計(jì)科學(xué)與實(shí)踐;2011年01期

，

本文編號(hào)：1953902

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/shekelunwen/shgj/1953902.html

上一篇：未婚男研究生自我因素對擇偶偏好影響的探究
下一篇：蘭州學(xué)刊2012年總目錄

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分類數(shù)據(jù)中高維列聯(lián)表可壓縮性研究